Artificial intelligence can rediscover illegible letters and words in ancient Hebrew and Aramaic inscriptions
Students at Ben-Gurion University of the Negev applied an extended masked language modeling approach to corrupted inscriptions in Hebrew and Aramaic
Every year more ancient texts are discovered in both the Hebrew and Aramaic languages throughout the Near East. The analysis of these texts is extremely important for researchers studying the culture and history of the region. As many inscriptions are damaged over time due to earthquakes, fires, political conflicts, and other natural and human-related causes, epigraphists encounter a major challenge in reconstructing the missing parts of these valuable writings. Until now, they have used time-consuming manual procedures to estimate the missing content.
The Department of Software and Information Systems Engineering students at Ben-Gurion University of the Negev have approached this challenge as an extended masked language modeling task where the damaged content can comprise single characters, character n-grams (partial words), single complete words, and multi-word n-grams.
In their final project under the supervision of Prof. Mark Last, the fourth-year undergraduate students Niv Fono, Harel Moshayof, Eldar Karol, and Itay Asraf applied the masked language modeling approach to corrupted inscriptions in Hebrew and Aramaic. Their model, "Embible," was highlighted at the latest meeting of the European Chapter of the Association for Computational Linguistics last month.
The students trained the system on 22,144 sentences from the Old Testament. The system was tested on the other 536 sentences with noted success. An ensemble of word and character prediction models had the highest accuracy.
"We can help historians who have devoted their lives to recreating these ancient texts as accurately as possible," says Prof. Last, "Furthermore, I believe the model can be extended to cover other morphologically rich ancient languages."