Translator's Toolbox:
Technical Specifications

  1. Objective:

    To create a tool which uses technology to expedite translation, yet still allows and requires a human translator to make linguistic decisions.

  2. Problems with Current Translation Methods.

    1. Manual Translation:

      Manual translation is simply too time-consuming. Especially if one is relatively unfamiliar with a given language, one spends too much time flipping through the pages of a dictionary looking up words.

    2. Machine (Babelfish) Translation:

      Although there may be better machine translation programs on the market, there are severe problems with the popular free tool Babelfish. The primary difficulty with Babelfish is that it makes translation choices for the writer. The user enters her text, presses the "Translate" button, and Babelfish chooses dictionary equivalents for each word. This often results in extremely awkward translations.

  3. Proposed Solution:

    1. Automate Dictionary Searches

      This software will expedite translation by automating the process of look-ups, giving the user a list of possible dictionary equivalents for each word. It will then allow the user to pick between this list of possible equivalents and save her choices.

    2. Produce Rough Drafts, not Final Ones

      This program intends to assist the translator in her work. It will *not* produce perfect, finished translations. Using this program will result in a first draft of the work, a run through the translation dictionary looking for equivalencies. However, many features of this draft will need modification: the syntax will be uneven at best, there will probably be some ambiguities about word choices. This program assumes and requires a human intelligence to interact with it.

    3. Limit scope of release

      Currently the audience for this work is myself, in order to aid me in my various translation projects, but I would be interested in eventually preparing it for some kind of general release.

    4. Anticipate change. Prepare for modular accretions

      Development of this project will occur incrementally. We do not think we will find the best design solutions right away. We feel that we will need to modify and add many components to this software. We need to develop a design approach which will anticipate the future addition of modules and significant code changes.

  4. User Interface

    1. Step One: Input Text




      Notes:

      1. The user enters their text into the text area
      2. The user can load a file into the text area using the "Load" button
      3. The user clears the text area by pressing the clear button
      4. When the user hits the "Translate" button, we go to the Step Two of the translation process

    2. Step Two: Choose Word Translations


      ORIGINAL: Yo quiero pizza
      TRANSLATION:


      Notes:

      1. The spacing of these lines will be tightly controlled. The length of each line of original text will be limited to 80 characters. Using tables, we will place the original word directly above each corresponding translation option list.
      2. The first option on the select will always be the original, untranslated word. This will be the default selection for all words.
      3. The second option, "Enter Word" will open up a dialogue box allowing a user to enter in her own word. I am not sure whether we will be implementing this functionality in the initial release of the software, but it should be included in a later revision
      4. Beneath these two options will be possible translations found in a dictionary database search
      5. The "Save" button will open up the necessary Save Dialogue box. The translation draft, containing all the translation choices will be saved out. As detailed in Step Three, we will be saving the original untranslated text along with the translation.

    3. Step Three: Confirmation of Draft Saved.

      1. At the end of the process, the user will see a screen with the message:
        Your draft has been saved.
      2. For each line, the original text will be saved in the draft above the translated text:
        Yo quiero pizza
        I want pizza
      3. Because this is a draft, we will provide lots of whitespace for the hapless translator. There will be two lines of whitespace between every original-translated line pair. There will be six lines of whitespace between every paragraph.
      4. Perhaps, in a future version, we will allow the user to control whether or not she wants the original, untranslated text saved with the translation. Perhaps we will allow the user to control the amount of whitespace. For now, however, this is the design.
  5. Functional Breakdown

    There are many many features that can and should be added to a piece of translation software. Some of these features should immediately available as part of the core functionality, but most of these can be added later as separate modules.

  6. Core functionality:


    1. Words and Definitions


      The core motivation for this project is to give the translator an opportunity to choose a definitions for each word, while reducing the time needed to look through a dictionary. We can conceive of the following simple one-to-many map as a model for words and definitions:


      
      WORD                            DEFINITIONS
      ----                           ------------
                                     desire
      quiero   --------------->      want
                                     wish
      
      

      The major problem with commercial translation products is that they try to impose a one-to-one mapping.

      For Example:
      
      WORD                            DEFINITIONS
      ----                           ------------
      quiero   --------------->      want
      
      

      The proper understanding of this mapping suggests a fairly simple object model:

      1. AbstractWord:

        An AbstractWord object is the base class for Word objects and Deinition objects. Perhaps it will contain its own Unicode String representation. Perhaps it will contain information about its language. Perhaps it will contain no data at all.

      2. RealWord

        A RealWord subclasses from AbstractWord. As we will learn in the discussion below about Roots and Suffixes, a word might be a complex entity. For example, we might want to know that the word "pizzas" contains both the root "pizza" and the suffix "-s", indicating a plurality. A RealWord will be able to store this kind of information. Additionally, "RealWord" probably does not refer to only one object class, but to a set of object classes

      3. Verb

        A Verb also is a subclass of AbstractWord. It is, in fact, a kind of RealWord. With verbs, we have to store information such as the tense, the voice, and the person of the verb.

      4. Further Analysis

        See a more fully developed analysis of Words in the following document: A Closer Analysis of Words

      5. Definition

        A Definition object is a subclass of Word object. A definition wraps each element retrieved from a dictionary database call.

      6. DefinitionList

        A DefinitionList is a collection of Definition objects In the most simple case, the database is queried with a Word object ("quiero") and returns a DefinitionList object (["desire", "want", "wish"])

    2. Roots and Suffixes

      Unfortunately, we rarely actually encounter simple words. Instead, what we tend to find are words which are composed of roots and prefixes and/or suffixes. Before we make a call to the database, we need to decompose a word into these parts.

      This program requires us to write a fairly extensive parsing engine to decompose words. We would probably use a tool like LEX to create a parse tree. We would actually hope that somebody else has already created a parsing engine. {Where might we look to find one? }

      For example: "Damelo" must be decomposed into the root "Da"-->"give", the suffix "me"--> "to me", and the suffix "lo"--> "it"

      Luckily, in Spanish, we only have to worry about suffixes

      Probably, the most simple way to decompose words would be for the program to scan each word for suffixes. For example, if we saw a word ends with "me" and "-lo", we will split off these two components, save them, and reduce the word to its root "da".

      We are hoping that there are only a fairly limited number of common prefixes and suffixes, such as those which indicate direct and indirect objects, and those which indicate pluralities.

      In some cases, however, it may be better to simply store multiple roots in the parsing engine rather than decomposing each word. For example, the Spanish word for "red" is "rojo" for a masculine object and is "roja" for a feminine object. We could store the root "roj" in the database, and consider "-o" and "-a" as suffixes. This topic and these tradeoffs still must be considered carefully.

      Already in this example, there are some ambiguities about the meanings of these suffixes. The suffix "me" can signify either the object or the direct object. One distinguishes the meaning of this suffix by its syntactical placement in a sentence. However, for this initial release, we will not be bothering with deducing syntax. The human translator will have to deduce the proper meaning herself.

    3. VERBS: Conjugations and Infinitives

      Conjugating verbs dramatically complicates the problem of roots, prefixes and suffixes.

      In the example above of "Damelo", we are left with the root "Da" after we remove the suffixes "me" and "lo". However, it would probably be foolish if we used "Da" as our query term in our Definition database. The verb "Da", "give", is the second-person-singular imperative form of the verbal infinitive "dar", "to give".

      It is the infinitive, this most abstract representation of the verb, that we should use in our queries.

      1. Two Levels of Indirection

        We find that verbs require two levels of indirection.

        Most words require only one level of indirection

        
        WORD                            DEFINITION_LIST
        ----                            ---------------
        
                  Database Lookup
        casa    --------------------->   house
                                         home
                                         business
        
        
        

        Verbs, however, require two levels of indirection

        
        WORD           INFINITIVE        DEFINITION_LIST
        -----         ------------       ----------------
        
        da   ------->    dar       -----> to give
                                          to strike
                                          to emit
        
        
      2. Storing Verbal Forms

        The additional level of indirection for verbs is quite complicated. From a given conjugation, we must find the root.

        In addition, while crossing from the conjugation to the root, we need to store the particular form of the conjugation. For example we need to record that "da" is the second person singular imperative form. We would propose that we do not try to conjugate the definition, but merely indicate the form

        That is we should *not* do the following


        
        WORD                            DEFINITIONS
        ----                           ------------
                                       give
        da       --------------->      strike
                                       emit
        
        

        Instead, we should do the following

        
        WORD                            DEFINITIONS
        ----                           ------------
                                       to give  [2, S, I]
        da       --------------->      to strike [2, S, I]
                                       to emit   [2, S, I]
        
        

        Where [2, S, I] indicates second person, singular, imperative.

      3. Finding Infinitives: Regular and Irregular Verbs

        Determining the proper way to deduce infinitives from particular verbal forms repeats many of the same problems we encountered dealing with prefixes and suffixes.

        In certain cases, we can rely on the regularity of conjugations to transform a given verbal form to its infinitive. In certain cases, we must store multiple roots. In other cases, the conjugations are so irregular, we should probably store each verbal form directly in the database.

        1. Regular Verbs

          Regular verbs have definite ways of expressing their tenses. For example, all regular verbal infinitives that end in -ar, such as "hablar", "to speak", have regular forms for the present tense conjugation, ending with the suffixes "-o", "-as", "-a", "-amos", "-ais", "-an".

          Thanks to this predictable regularity, we can use our typical method of scanning for suffixes and deducing roots to find infinitives. As we see, our program will have to do a lot of scanning, because we will have to check every word that ends with an "a" or an "o", et cetera. Oh well.

        2. Orthographic Changing Verbs

          For certain Spanish verbs, the last few letters of certain verbs alters in determinable ways in order to preserve the final sound. For example, in verbs which end with "-zar", the "z" changes to a "c" before an "e". Thus, with the verb "abrazar" [to embrace], one writes "yo abracé" "I embraced", and "el abrazó", "he embraced".

          There are two possible ways of dealing with these orthographic irregularities:

          1. We can attempt to store multiple roots in the database, so that we will recognize that the root "abrac-" and that the root "abraz-" actually both indicate the infinitive "abrazar"
          2. Alternatively, we can add another set of suffixes to our scanning tree. That is, we can treat "abra-" as the root and treat "-cé" and "-zó" and so on as our suffixes.
          3. The first alternative is probably the better one.
        3. Radical Changing Verbs

          In certain cases, the roots of the verbs change in predictable ways. For example, with the verb "contar", "to count", the "o" sometimes changes to "ue", so that one writes "yo cuento" "I count", and "nosotros contamos" "we count".

          We will choose to store multiple roots in the database, so that we will recognize both "cont-" and "cuent-" as indicating the infinitive "contar"

        4. Irregular Verbs

          In most cases, irregular verbs should be stored directly in the database. A verb like "ser" (to be) occurs so frequently and does not have enough of regularity to be efficiently decomposed by a parsing engine. Instead of bothering to find suffixes and roots, we will merely store all forms of the verb "soy", "eres", "es", "somos", etc as pointing to the infinitive "ser".

          Where we can find sufficient patterns of regularity in other verbs, we may try to write certain decomposition rules for separating out suffixes and roots, but, in general, we will store all forms of irregular verbs.

  7. Workflow Analysis

    Through the writing of this specification, we have gained a greater appreciation of the complexity of our problem. In this section, we will reconsider our design.

    We need a set of objects to control the user interface of the program. We have written user interfaces before, but we are not sure whether we remember how best to decouple the UI from the functionality of the translator. Any suggestions for reading would be appreciated. To translate a piece of text, we split the text into a series of word-tokens.

    A RealWord is not a simple base object, but a collection of objects. If it is a verb, for example, it will have to record its tense, its person, and so on. If it is a noun, it will have to record its singularity or plurality and so on. I do remember that this is one of the typical design patterns, but I can't remember which one. If anyone can point me to some reading, please do so.

    It would seem that we first need to write a pretty large LEX parser to analyze each token. We could perhaps try to do something with PERL hashes, but LEX would probably be more efficient. Only then will we want to make a database call to find a definition.

    Once we have correctly identified the tokens, and made our database calls, setting up the select boxes, the save buttons, and so on will be pretty simple. We have done it before.

    The central problem of this work is: how do we build our LEX parser and our dictionary database? Has this been done before? Is there any freeware out there? Where would we look for it?

    We are still not entirely sure about what language to write this program in. Professor Abbas Moghtanei has recommended Java as the best choice. Java does seem like the easiest program for writing user interfaces. However, we don't know whether we might need some the flexibility of PERL for parsing tokens. If the LEXer will take care of all the parsing, maybe we don't have to worry about this at all. We also worry that certain users might not have a Java Virtual Machine on their computer, and so we might be forced to rewrite everything in C++ in the future. We don't look forward to this.

  8. Additional Modules Here are some ideas for future developments of this software.

    1. Modular Extensibility

      We will be developing this software in fits and starts. Our intention is not to make a perfect piece of software from the start, but to make a program to which we can easily add modules which will extend its functionality. We must design this software with this in mind.

    2. Add Memory of Past Choices

      A DefinitionList could eventually evolve a memory for the choices the user usually makes. Definitions would be sorted by frequently and recentness, so that the definition chosen most frequently and most recently would appear at the top of the list. Each DefinitionWord in a DefinitionList would then be given a weight:


      WEIGHT = A (Number_Of_Times_Chosen) X B (Current_Time - Last_Time_Chosen)

      (A and B are constants)

      The Translator's workbench would have to be modified in the following ways: The database would have to store frequency and date of choices. Also, we would have to write an algorithm to calculate weights and sort DefinitionLists on their basis.

    3. Idiomatic phrases:

      Currently, the program can only analyze text on a word-for-word basis. It would be nice to allow the program to identify idiomatic phrases. For example, the program should eventually be able to identify the phrase "dar de comer" as "to feed" instead of "to give [dar] of [de] to feed [comer]". At this point, we are not going to work on a solution for this issue, but we are aware that it is a serious one. We will have to rely on the intelligence of the translator to handle these problems.

    4. Syntax fixer

      This program makes no attempts whatsoever to help the translator correct the syntax of the translated text. Perhaps, at a later date, we would want to do this. In such a case, we would want to store grammatical information in each RealWord, expressing whether it is a verb or a noun or an adjective, and so on. It is unclear whether the syntax correction must occur simultaneously with the dictionary look ups. Perhaps the syntax can be fixed after the definitions have been found? In any case, this is an incredibly large undertaking, and I will certainly not be doing anything on it soon.

    5. Storing alternate choices to database

      As mentioned above, the user can choose alternate definitions for each word. The user should similarly be allowed to save these alternate definitions to the database. This seems like a pretty important feature and should be added soon.

    6. Extend Software to Other Languages and Other Users

      In the initial release, the primary audience will be Mitch who will be translating Spanish to English. However, we will need to extend the usage of this software to other languages. We must make sure that this modularity is built in from the start: that a user can just plug in a LEXer and a database for Romanian and Serbo-Croatian, for example, and translate from Romanian to Serbo-Croat.

    7. Other Stuff

      I had some other features in mind, but I don't have my initial notebook here. In any case, they are not centrally important to the core functionality