Transducing German Cardinal Numbers

Sep 6, 2020

This is a little excercise in writing finite-state transducers. The goal is to convert spelled-out German cardinal numbers from 1 to 9999 to numerals and vice versa. I will construct the transducer by going from written-out numbers to numerals, e.g., from zweiunddreißig to 32. The implementation is done with the finite-state toolkit foma.

Intermediate representation

In a first step, we convert each written-out number into an intermediate representation by transducing compound parts to numerals in linear order, for example, neuntausend-zweihundert-acht-und-vierzig (hyphens added for better readability) should become 9000+200+8&40. Let's start with two-digit numbers.

TwoToNine = zwei:2  | drei:3   | vier:4  | fünf:5  |
            sechs:6 | sieben:7 | acht:8  | neun:9  ;

Teens = zehn:     [1 %0]   | elf:      [1 1]    | 
        zwölf:    [1 2]    | dreizehn: [1 3]    |
        vierzehn: [1 4]    | fünfzehn: [1 5]    |
        sechzehn: [1 6]    | siebzehn: [1 7]    |
        achtzehn: [1 8]    | neunzehn: [1 9]    ;

TwentyToNinety = zwanzig:[2 %0]  | dreißig:[3 %0]  |
                 vierzig:[4 %0]  | fünfzig:[5 %0]  |
                 sechzig:[6 %0]  | siebzig:[7 %0]  |
                 achtzig:[8 %0]  | neunzig:[9 %0]  ;

TwoDigitNumber = eins:1 | TwoToNine | Teens | TwentyToNinety |
                 [ein:1 | TwoToNine] und:%& TwentyToNinety;

The digit 1 is special, since it sometimes appears as ein and sometimes eins, that's why it's not part of the first transducer. Some Teens could be decomposed further, but this would only complicate things. Because 0 has special semantics in foma regex, we need to make sure to escape 0 literals with the percentage symbol. The ampersand that replaces und indicates that the numbers around it need to be swapped later.

For larger numbers, we need hundreds and thousands, so we define two more transducers.

Hundred  = [ein:1 | TwoToNine] hundert:[%0 %0 %+];
Thousand = [ein:1 | TwoToNine] tausend:[%0 %0 %0 %+];

Now we can put everything together. For the sake of simplicity, we ignore some variants: we recognize einhundert and eintausend, but not hundert and tausend; we recognize einhundertfünf, but not einhundertundfünf. These variants can be added easily, but the expressions become a bit messier.

Intermediate = (Thousand) (Hundred) (TwoDigitNumber);

If we check the lower-words1 of this transducer, we see the following:

foma[1]: lower-words

1000+
1000+100+
1000+100+1&20
...

Now we need to transform these intermediate forms into the desired output.

Cascaded transducers

We'll move from right to left. In the end there may be a trailing plus (introduced by Hundred or Thousand). We don't need it, so we add a cleanup rule.

CleanPlus = %+ -> 0 || _ .#.;

Then we remove any trailing zero after an ampersand and swap the digits around the ampersand.

Trunc = %0 -> 0 || %& ? _ .#.;

def Ins(X) [..] -> X || X %& ? _ .#.;
Copy = Ins(1) .o. Ins(2) .o. Ins(3) .o. 
       Ins(4) .o. Ins(5) .o. Ins(6) .o.
       Ins(7) .o. Ins(8) .o. Ins(9) ;

Delete = ? %& -> 0 ;
Swap   = Trunc .o. Copy .o. Delete;

Maybe there is some more elegant syntax to do this, but I define a helper template that inserts a digit in the context of itself, the ampersand and some other digit. By applying this template for each non-zero digit, we copy the digit before the ampersand to the end of the string. Finally we delete the original digit and the ampersand, completing the swap.

Now we can move further to the left and add our two digit number to the hundreds.

N = [%0|1|2|3|4|5|6|7|8|9];

FixHundred = %0 %+     -> 0 || %0 _ N .#. ,,
             %0 %0 %+  -> 0 || _ N N .#.  ;

We simply delete as many zeros as there are digits after the plus symbol. The same is repeated with the thousands to the left.

FixThousand = %0 %+       -> 0  || %0 %0 _ N .#. ,,
              %0 %0 %+    -> 0  || %0 _ N N .#.  ,,
              %0 %0 %0 %+ -> 0  || _ N N N .#.    ;

We compose the pieces to obtain our transducer for cardinal numbers.

Card = Intermediate .o. CleanPlus .o. Swap .o. 
       FixHundred .o. FixThousand;

Result

The final transducer has 10,000 paths: 9,999 numbers and the empty string. FSTs are closed under inversion (input and output labels are simply switched), so we can apply our cardinal number FST in both directions, as generator and parser.

foma[0]: regex Card;
4.8 kB. 37 states, 212 arcs, 10000 paths.

foma[1]: up
apply up> 2345
zweitausenddreihundertfünfundvierzig

foma[1]: down
apply down> fünftausenddreiundzwanzig
5023

1 Words of the lower or second projection of Intermediate, that is, words o for which an input i exists such that Intermediate maps i to o.