Lightweight diacritic restoration for V4 languages

Diacritics restoration became a ubiquitous task in the Latin-alphabet-based English-dominated Internet language environment. We showcase a small footprint 1D convolution-based solution, running in a web browser, which surpassed the performance of similarly sized models.

Many languages have alphabets where some characters are derived from other characters using diacritical marks. The goal of diacritics restoration is to restore diacritical marks given an input text which does not contain, or only partially contains the proper diacritical marks. We focus on the languages of the Visegrad Group (V4), with emphasis on Hungarian.

Approaches to diacritics restoration have evolved from rule-based and statistical solutions to the application of machine learning models. We apply a fast language-independent method with small footprint, using a neural architecture based on 1D convolutions, the so called Acausal Temporal Convolutional Networks (A-TCN) [1].

A-TCNs are 1D fully convolutional networks, they are built with dilation factors which increase exponentially by the depth of the network. They contain residual blocks involving a series of transformations, the result of which are then added to the input. The transformation consists of a dilated convolution followed by a normalization layer, activation function, and dropout.

Our model can be converted to ONNX (Open Neural Network Exchange [link: https://github.com/onnx/onnx ]), a cross-platform neural network format. ONNX.js [link: https://github.com/microsoft/onnxjs ] is a JavaScript library, which can run models in ONNX format, which in turn makes it possible to run our model in the browser.

Converting a model to work with ONNX.js requires some care. For example, LSTMs have not been supported yet, and even 1D convolutions had to be simulated with 2D convolutions. Another difficulty is that the model allows arbitrary input lengths, but in ONNX.js the first inference fixes the input sequence length. The solution is to dynamically reload the model. If the input is longer than the current limit, the model is reloaded with double length.

Our diacritics restoration demonstration in Czech, Polish, Hungarian and Slovak is available here:

https://web.cs.elte.hu/~csbalint/diacritics/demo.html?lang=en&model_lang=HU

This research is conducted in the AI Research Group at the Institute of Mathematics, Eötvös Loránd University.

For additional information please contact: csbalint@protonmail.ch

References

[1] Alqahtani, S., Mishra, A., Diab, M.: Efficient convolutional neural networks for diacritic restoration. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 1442–1448 (2019)

%d bloggers like this: