To create a tool which uses technology to expedite translation, yet still allows and requires a human translator to make linguistic decisions.
Manual translation is simply too time-consuming. Especially if one is relatively unfamiliar with a given language, one spends too much time flipping through the pages of a dictionary looking up words.
Although there may be better machine translation programs on the market, there are severe problems with the popular free tool Babelfish. The primary difficulty with Babelfish is that it makes translation choices for the writer. The user enters her text, presses the "Translate" button, and Babelfish chooses dictionary equivalents for each word. This often results in extremely awkward translations.
This software will expedite translation by automating the process of look-ups, giving the user a list of possible dictionary equivalents for each word. It will then allow the user to pick between this list of possible equivalents and save her choices.
This program intends to assist the translator in her work. It will *not* produce perfect, finished translations. Using this program will result in a first draft of the work, a run through the translation dictionary looking for equivalencies. However, many features of this draft will need modification: the syntax will be uneven at best, there will probably be some ambiguities about word choices. This program assumes and requires a human intelligence to interact with it.
Currently the audience for this work is myself, in order to aid me in my various translation projects, but I would be interested in eventually preparing it for some kind of general release.
Development of this project will occur incrementally. We do not think we will find the best design solutions right away. We feel that we will need to modify and add many components to this software. We need to develop a design approach which will anticipate the future addition of modules and significant code changes.
Yo quiero pizza
I want pizza
There are many many features that can and should be added to a piece of translation software. Some of these features should immediately available as part of the core functionality, but most of these can be added later as separate modules.
The core motivation for this project is to give the translator an opportunity to choose a definitions for each word, while reducing the time needed to look through a dictionary. We can conceive of the following simple one-to-many map as a model for words and definitions:
WORD DEFINITIONS ---- ------------ desire quiero ---------------> want wish
The major problem with commercial translation products is that they try to impose a one-to-one mapping.
For Example:WORD DEFINITIONS ---- ------------ quiero ---------------> want
The proper understanding of this mapping suggests a fairly simple object model:
An AbstractWord object is the base class for Word objects and Deinition objects. Perhaps it will contain its own Unicode String representation. Perhaps it will contain information about its language. Perhaps it will contain no data at all.
A RealWord subclasses from AbstractWord. As we will learn in the discussion below about Roots and Suffixes, a word might be a complex entity. For example, we might want to know that the word "pizzas" contains both the root "pizza" and the suffix "-s", indicating a plurality. A RealWord will be able to store this kind of information. Additionally, "RealWord" probably does not refer to only one object class, but to a set of object classes
A Verb also is a subclass of AbstractWord. It is, in fact, a kind of RealWord. With verbs, we have to store information such as the tense, the voice, and the person of the verb.
See a more fully developed analysis of Words in the following document: A Closer Analysis of Words
A Definition object is a subclass of Word object. A definition wraps each element retrieved from a dictionary database call.
A DefinitionList is a collection of Definition objects In the most simple case, the database is queried with a Word object ("quiero") and returns a DefinitionList object (["desire", "want", "wish"])
Unfortunately, we rarely actually encounter simple words. Instead, what we tend to find are words which are composed of roots and prefixes and/or suffixes. Before we make a call to the database, we need to decompose a word into these parts.
This program requires us to write a fairly extensive parsing engine to decompose words. We would probably use a tool like LEX to create a parse tree. We would actually hope that somebody else has already created a parsing engine. {Where might we look to find one? }
For example: "Damelo" must be decomposed into the root "Da"-->"give", the suffix "me"--> "to me", and the suffix "lo"--> "it"
Luckily, in Spanish, we only have to worry about suffixes
Probably, the most simple way to decompose words would be for the program to scan each word for suffixes. For example, if we saw a word ends with "me" and "-lo", we will split off these two components, save them, and reduce the word to its root "da".
We are hoping that there are only a fairly limited number of common prefixes and suffixes, such as those which indicate direct and indirect objects, and those which indicate pluralities.
In some cases, however, it may be better to simply store multiple roots in the parsing engine rather than decomposing each word. For example, the Spanish word for "red" is "rojo" for a masculine object and is "roja" for a feminine object. We could store the root "roj" in the database, and consider "-o" and "-a" as suffixes. This topic and these tradeoffs still must be considered carefully.
Already in this example, there are some ambiguities about the meanings of these suffixes. The suffix "me" can signify either the object or the direct object. One distinguishes the meaning of this suffix by its syntactical placement in a sentence. However, for this initial release, we will not be bothering with deducing syntax. The human translator will have to deduce the proper meaning herself.
Conjugating verbs dramatically complicates the problem of roots, prefixes and suffixes.
In the example above of "Damelo", we are left with the root "Da" after we remove the suffixes "me" and "lo". However, it would probably be foolish if we used "Da" as our query term in our Definition database. The verb "Da", "give", is the second-person-singular imperative form of the verbal infinitive "dar", "to give".
It is the infinitive, this most abstract representation of the verb, that we should use in our queries.
We find that verbs require two levels of indirection.
Most words require only one level of indirection
WORD DEFINITION_LIST ---- --------------- Database Lookup casa ---------------------> house home business
Verbs, however, require two levels of indirection
WORD INFINITIVE DEFINITION_LIST ----- ------------ ---------------- da -------> dar -----> to give to strike to emit
The additional level of indirection for verbs is quite complicated. From a given conjugation, we must find the root.
In addition, while crossing from the conjugation to the root, we need to store the particular form of the conjugation. For example we need to record that "da" is the second person singular imperative form. We would propose that we do not try to conjugate the definition, but merely indicate the form
That is we should *not* do the following
WORD DEFINITIONS ---- ------------ give da ---------------> strike emit
Instead, we should do the following
WORD DEFINITIONS ---- ------------ to give [2, S, I] da ---------------> to strike [2, S, I] to emit [2, S, I]
Where [2, S, I] indicates second person, singular, imperative.
Determining the proper way to deduce infinitives from particular verbal forms repeats many of the same problems we encountered dealing with prefixes and suffixes.
In certain cases, we can rely on the regularity of conjugations to transform a given verbal form to its infinitive. In certain cases, we must store multiple roots. In other cases, the conjugations are so irregular, we should probably store each verbal form directly in the database.
Regular verbs have definite ways of expressing their tenses. For example, all regular verbal infinitives that end in -ar, such as "hablar", "to speak", have regular forms for the present tense conjugation, ending with the suffixes "-o", "-as", "-a", "-amos", "-ais", "-an".
Thanks to this predictable regularity, we can use our typical method of scanning for suffixes and deducing roots to find infinitives. As we see, our program will have to do a lot of scanning, because we will have to check every word that ends with an "a" or an "o", et cetera. Oh well.
For certain Spanish verbs, the last few letters of certain verbs alters in determinable ways in order to preserve the final sound. For example, in verbs which end with "-zar", the "z" changes to a "c" before an "e". Thus, with the verb "abrazar" [to embrace], one writes "yo abracé" "I embraced", and "el abrazó", "he embraced".
There are two possible ways of dealing with these orthographic irregularities:
In certain cases, the roots of the verbs change in predictable ways. For example, with the verb "contar", "to count", the "o" sometimes changes to "ue", so that one writes "yo cuento" "I count", and "nosotros contamos" "we count".
We will choose to store multiple roots in the database, so that we will recognize both "cont-" and "cuent-" as indicating the infinitive "contar"
In most cases, irregular verbs should be stored directly in the database. A verb like "ser" (to be) occurs so frequently and does not have enough of regularity to be efficiently decomposed by a parsing engine. Instead of bothering to find suffixes and roots, we will merely store all forms of the verb "soy", "eres", "es", "somos", etc as pointing to the infinitive "ser".
Where we can find sufficient patterns of regularity in other verbs, we may try to write certain decomposition rules for separating out suffixes and roots, but, in general, we will store all forms of irregular verbs.
Through the writing of this specification, we have gained a greater appreciation of the complexity of our problem. In this section, we will reconsider our design.
We need a set of objects to control the user interface of the program. We have written user interfaces before, but we are not sure whether we remember how best to decouple the UI from the functionality of the translator. Any suggestions for reading would be appreciated. To translate a piece of text, we split the text into a series of word-tokens.
A RealWord is not a simple base object, but a collection of objects. If it is a verb, for example, it will have to record its tense, its person, and so on. If it is a noun, it will have to record its singularity or plurality and so on. I do remember that this is one of the typical design patterns, but I can't remember which one. If anyone can point me to some reading, please do so.
It would seem that we first need to write a pretty large LEX parser to analyze each token. We could perhaps try to do something with PERL hashes, but LEX would probably be more efficient. Only then will we want to make a database call to find a definition.
Once we have correctly identified the tokens, and made our database calls, setting up the select boxes, the save buttons, and so on will be pretty simple. We have done it before.
The central problem of this work is: how do we build our LEX parser and our dictionary database? Has this been done before? Is there any freeware out there? Where would we look for it?
We are still not entirely sure about what language to write this program in. Professor Abbas Moghtanei has recommended Java as the best choice. Java does seem like the easiest program for writing user interfaces. However, we don't know whether we might need some the flexibility of PERL for parsing tokens. If the LEXer will take care of all the parsing, maybe we don't have to worry about this at all. We also worry that certain users might not have a Java Virtual Machine on their computer, and so we might be forced to rewrite everything in C++ in the future. We don't look forward to this.
We will be developing this software in fits and starts. Our intention is not to make a perfect piece of software from the start, but to make a program to which we can easily add modules which will extend its functionality. We must design this software with this in mind.
A DefinitionList could eventually evolve a memory for the choices the user usually makes. Definitions would be sorted by frequently and recentness, so that the definition chosen most frequently and most recently would appear at the top of the list. Each DefinitionWord in a DefinitionList would then be given a weight:
WEIGHT = A (Number_Of_Times_Chosen) X B (Current_Time - Last_Time_Chosen)
(A and B are constants)
The Translator's workbench would have to be modified in the following ways: The database would have to store frequency and date of choices. Also, we would have to write an algorithm to calculate weights and sort DefinitionLists on their basis.
Currently, the program can only analyze text on a word-for-word basis. It would be nice to allow the program to identify idiomatic phrases. For example, the program should eventually be able to identify the phrase "dar de comer" as "to feed" instead of "to give [dar] of [de] to feed [comer]". At this point, we are not going to work on a solution for this issue, but we are aware that it is a serious one. We will have to rely on the intelligence of the translator to handle these problems.
This program makes no attempts whatsoever to help the translator correct the syntax of the translated text. Perhaps, at a later date, we would want to do this. In such a case, we would want to store grammatical information in each RealWord, expressing whether it is a verb or a noun or an adjective, and so on. It is unclear whether the syntax correction must occur simultaneously with the dictionary look ups. Perhaps the syntax can be fixed after the definitions have been found? In any case, this is an incredibly large undertaking, and I will certainly not be doing anything on it soon.
As mentioned above, the user can choose alternate definitions for each word. The user should similarly be allowed to save these alternate definitions to the database. This seems like a pretty important feature and should be added soon.
In the initial release, the primary audience will be Mitch who will be translating Spanish to English. However, we will need to extend the usage of this software to other languages. We must make sure that this modularity is built in from the start: that a user can just plug in a LEXer and a database for Romanian and Serbo-Croatian, for example, and translate from Romanian to Serbo-Croat.
I had some other features in mind, but I don't have my initial notebook here. In any case, they are not centrally important to the core functionality