New AI Model Could Help Unlock Secrets Of Damaged Ancient Texts

By Sara Miller, NoCamels April 30, 2024 3 minutes

A damaged fragment of the Dead Sea Scrolls (Shai Halevi/Israel Antiquities Authority)

In 1947, a Bedouin shepherd stumbled upon the first part of a treasure trove of 15,000 ancient Jewish texts, in a cave just a stone’s throw from the shores of the Dead Sea.

The 2,000-year-old texts – which later came to be known worldwide as the Dead Sea Scrolls – are a series of manuscripts written mainly in Hebrew and mainly on parchment, which shed light on contemporaneous Jewish life in the Holy Land.

And while some of the manuscripts are complete texts, there are thousands of fragments whose poor condition means that they cannot be deciphered.

But a new AI system developed at Ben-Gurion University of the Negev (BGU) could be a solution for those unfathomable Dead Sea scrolls and other ancient texts whose poor condition leaves us with more questions than answers.

The new system is the work of four undergraduates from BGU’s Department of Software and Information Systems Engineering, who produced it as part of their final fourth-year project. It employs masked language modeling (MLM) – using context to predict unseen words in a phrase or sentence – to decipher the text in corrupted inscriptions in Hebrew and Aramaic.

Prof. Mark Last: Creating a MLM for ancient Hebrew not an easy task (Courtesy)

The process created by Itay Asraf, Niv Fono, Eldar Karol and Harel Moshayof is similar to large language models (AI platforms that process enormous amounts of written text in order to comprehend and create human language), their supervising professor Mark Last tells NoCamels.

The main difference between the standard masked language modeling and the newly developed platform is the way in which missing text is presented, the professor explains.

In MLM, the kind of text to be examined is selected in advance, be it a word, a phrase or a sentence. But there is not that luxury when trying to decipher fragmented ancient manuscripts.

“In the case of a damaged ancient inscription, the parts that are missing might be different,” Last says. “Sometimes they include one word, sometimes they include a partial word, sometimes they include several words.”

Last explains that he himself came up with the initiative and suggested it to the students, although he concedes it was “not a conventional project” for students.

“They knew from the start that it’s not going to be easy,” he recalls. “It’s not an easy task, but they did the work. They became enthusiastic about it, and it paid off.”

The entire project took about a year to complete and earned all four students a 100 percent grade.

Sign up for our free weekly newsletter

Last says that he was inspired to create the project due to his own familiarity with large language models and from memories of watching his mother, a doctor of ancient history, as she tried to decipher millennia-old inscriptions in Latin and Greek.

First, the four students found large language models and masked language modeling that were compatible with modern Hebrew, which akin to its ancient predecessor but unlike Western languages, is read from right to left and does not use a Roman alphabet.

“Then they started piling in the text so that the algorithm could understand what they were asking,” Last says.

Once the modern Hebrew data had been fed into the models, they used it to create a model based on ancient Hebrew.

The team used 22,144 sentences from the Old Testament, mainly in Hebrew, to train their model (Unsplash)

Last explains that because of a scarcity of Aramaic texts to feed to the LLMs, the emphasis was placed on Hebrew. So the four students used the biblical texts from the Old Testament – most in Hebrew but several in Aramaic too – to train the platform. In all, the team used 22,144 sentences from the Old Testament.

“We worked with biblical text from the Old Testament because in that case, we know the ground truth,” he says.

“So if we randomly mask words or parts of words and try to predict what is missing, we can always check how accurate our prediction was.”

The model, which the team has called Embible, was presented at the latest gathering of the European Chapter of the Association for Computational Linguistics, which took place in Malta last month.

Last believes that Embible will be of use to others like his own mother, who have spent years trying to decipher writings that are thousands of years old.

“We can help historians who have devoted their lives to recreating these ancient texts as accurately as possible,” he said.