ANNOUNCEMENT> Machine translations of the CBETA corpus

Marcus Bingenheimer's picture

Dear colleagues,

We would like to announce the beginnings of something that has the potential to deeply affect the field of Buddhist Studies. Since its inception, our field has relied heavily on the practice of translation to understand and discuss its texts. The interplay between the study of the primary sources in the original (Pali, Sanskrit, Chinese, Tibetan etc.) and the modern language discourse about them in secondary sources has been foundational to our scholarly practice.

Over the last few months, DeepL, a translation software company, has used training data from the Chinese Buddhist canon (Chi/Eng sentence pairs) provided by SuttaCentral to train a language model using their own state-of-the-art architecture. The model was then used to output a complete “translation” of the two largest parts of the CBETA corpus (Tn1-n2184,Tn2732-n2920, Xn1-n1671).
Based in part on the data from DeepL, we have trained another model “Linguae Dharmae” that is able to reproduce and in certain domains improves on the quality of the DeepL model. This model has now “translated” the Taishō part of the CBETA corpus (Tn1-n2184,Tn2732-n2920).

The results of these first machine translations do not yet meet scholarly standards, but are advanced enough to warrant this announcement.

Both models have been trained on very small datasets of parallel Chinese-English sentence pairs. By further accumulating training data, we expect to see large improvements in their performance. We encourage those who are interested to contribute, since training data is a major bottleneck at the current stage. We believe these machine-translations to have a significant impact on our field.

A ) Based on these results our predictions are:

 1. As more training data and better models become available, computers will produce better and better translations *of all canonical editions* with increasing accuracy. Taking into account the advances in natural language processing over the last ten years and the outcomes of the recent DeepL and Linguae Dharmae experiments we believe that human level quality translation for most texts can be achieved within the next decade. Within the next twenty years we will have a variety of serviceable machine translations of the vast majority of ancient Buddhist texts in modern languages.

2. As machine translations become more accurate, they will be used by scholars and religious communities alike. We will see websites offering complete “translations” produced by various models in varying quality. Before things get better they will first become more confusing.

3. For scholars, especially emerging scholars, translation as a practice will come to mean something different. Instead of becoming the proud first translator of a hitherto untranslated text, scholars will need to learn how to choose from and improve on a selection of machine translated “drafts” of their text and then adapt or annotate these. This back and forth between human and machine translation will become the norm over the next two decades. Sooner than most of us had expected, we will arrive at the point where Go players are now. We will be gleaners and cleaners following behind the translating machines.

What will the role of translation in Buddhist Studies be when basically all texts are available in a large number of different machine translations? Will we polish the style and quibble over register? Focus on annotation? What will that mean for the study of ancient Buddhist languages, already in a precarious position in many institutions?

B) For now, in order to improve machine translations of canonical texts, we need a) more and better training data, b) find ways to benchmark machine translations of Buddhist texts, and c) learn how machine translations work in the context of our scholarly practice. One strategy to understand machine translation is to develop typologies of mistakes which can inform the discussion. The few examples below introduce the two “translations” we have so far and some common mistakes.

1. Named entities, esp. translated or transliterated Indian names, are at times not identified correctly by the DeepL model:

    "source_sentence": "迦葉佛有執事弟子,名曰善友。"(T01n0001_001:0003a19)
DeepL "translation": "Kāśyapa Buddha has a disciple in charge named Shōtoku." 
    “source_sentence": "毗舍婆佛有子,名曰妙覺。"  (T01n0001_001:0003a28)
DeepL "translation": "Vaiśravaṇa Buddha had a son named Sumeru."

More common names the DeepL model often interprets correctly:
    "source_sentence": "爾時,提婆達多獲得四禪,而作是念:「此摩竭提國誰為最勝?」"  (T02n0100_001:0374b10)
DeepL "translation": "At the time, Devadatta attained the fourth meditation and thought, "Who's the best in this country of Magadhā?""

On this type of error the Linguae Dharmae model does better (still far from perfect):

    "source_sentence": "迦葉佛有執事弟子,名曰善友。",
Linguae Dharmae: "translation": "Kāśyapa Buddha had an attendant disciple named Sunetra."
    "source_sentence": "毗舍婆佛有子,名曰妙覺。",
Linguae Dharmae:"translation": "Viśvabhū Buddha had a son named Marvelous Enlightenment."
    "source_sentence": "爾時,提婆達多獲得四禪,而作是念:「此摩竭提國誰為最勝?」",
Linguae Dharmae: "translation": "At that point, Devadatta attained the fourth dhyāna and thought, Who is the most superior in this country of Magadha?"

2. With the first two attempts at machine translation we can see the models struggling with classical Buddhist Chinese syntax, very much like human learners. Typical headaches are the relationship between phrase level elements:

    "source_sentence": "信為道元功德母。", (T47n1970_001:0251a05)
DeepL:  "translation": "Faith is the mother of the merit of the origin of the way."
Linguae Dharmae: "translation": "Faith is the mother of the merit of Daoyuan."

Finding and maintaining the right implicit subject across sentences is usually not a problem for  humans, who (for now) still have a better grasp on context and are more consistent in our choices. It is difficult for machines to understand how sentences are connected, and neither DeepL nor Linguae Dharmae exhibit much context awareness (yet):

    "source_sentence": "於平日諸惡莫作。",
DeepL: "translation": "On ordinary days, you mustn't do evil."
Linguae Dharmae: "translation": "Don't do evil on the usual days.",
    
    "source_sentence": "眾善奉行。",
DeepL: "translation": "They consecrate themselves to the practice of goodness.",
Linguae Dharmae: "The monks wholeheartedly practiced it.",
    
    "source_sentence": "念念不離西方淨土。",
DeepL: "translation": "Every day, he's mindful that he won't leave the Pure Land of the West." 
Linguae Dharmae: "Each thought doesn't part from the pure land of the West."

[ 於平日諸惡莫作。眾善奉行。念念不離西方淨土。(T47n1970_001:0251a12-13) ~~ In everyday life one should not commit any evil, practice all that is good, and always keep the Pure Land of the West in one's mind.]

3. With historiographical texts, DeepL seems to have an edge over Linguae Dharmae:

    "source_sentence": "號象先。",T50n2062_004:0913c16_1
DeepL: "translation":   "translation": "He was called Xiangxian.",
Linguae Dharmae:  "translation": "He was preceded by the name elephant.",

    "source_sentence": "父為河南縣尹。","T50n2062_004:0913c18_5"
DeepL:"translation": "His father was the governor of Henan Province."
Linguae Dharmae: "translation": "His father was the governor of Henan Prefecture.",

    "source_sentence": "常對賓朋以大器期之。",T50n2062_004:0913c18_6
DeepL:"translation": "He often met his guests and friends and expected him to be a great man.",
Linguae Dharmae: "translation": "He always confronted his guests and friends with a large vessel to expect them.",
    

We will have to evolve typologies of mistakes in order to track and guide the progress to better models. Like with training data, we look forward to hearing from interested parties who would like to collaborate on this.

C) Ethical considerations: 

The mistakes in the current translation experiments are often different from that of previous machine translations. Errors in previous machine translations were often easy to spot because of obvious absurdities, or broken grammar. With current models, even when a sentence is translated incorrectly, it usually “sounds right” and is grammatically correct, i.e. it often reads plausible even where wrong. 
Should such first attempts be widely accessible? We have considered waiting for more advanced versions before making these first “translations” widely available to non-specialists as it could lead to significant misreadings of the teachings. On the other hand, this cultural moment demands a learning process regarding the origin and reliability of translations in general, and especially in fields like Buddhist Studies. Our field will have to evolve ways to present and reference machine translations of Buddhist texts responsibly.
We realize that translation in traditional cultures was often seen as a merit making process that required diligence and care. It is not clear whether machine translation creates merit and if so for whom, but we do hope we will be able to proceed into the new era of Buddhist translation with diligence and care.

D) Availability: 

Model: The DeepL machine translation has been produced by Till Westermann as a free contribution to our field. For business intelligence reasons DeepL cannot share the model architecture with us, which is the reason why we have trained our own model - Linguae Dharmae - using their data as a starting point. The transformer model Linguae Dharmae will be made available publicly soon with documentation and benchmarking on the Huggingface platform. 

Output: We are making the output of both models available here: https://github.com/BuddhaNexus/chn-machine-translations
The translation output by the DeepL model (Ver 2022-05) contains one file per fascicle. This version is precious as the first comprehensive attempt to render a large part of the CBETA corpus with the help of a machine. It includes both the Taishō texts 1-2184, 2732-2920, and the Zokuzōkyō texts (X) 1-1671.
The translated data of our own trained model “Linguae Dharmae” (Ver 2022-06) too comes in one file per fascicle. This is the second comprehensive attempt to render a large part of the Chinese canon with the help of a machine. It includes only the Taishō part of CBETA, not the Zokuzōkyō. 

To improve future versions the immediate need is now for more training data (Chin/Eng sentence pairs). Again, we look forward to hearing from interested parties who would like to collaborate on this.

All the best

Sebastian Nehrdich (nehrdbsd@gmail.com), Ayya Vimala, Justin Brody, Marcus Bingenheimer (m.bingenheimer@gmail.com)

 

Dear Sebastian Nehrdich, Ayya Vimala, Justin Brody, and Marcus Bingenheimer,

I would like to state at the outset of what is a rather strongly worded critique that I do not mean to question your intentions or commitment to the field of Buddhist Studies, nor to Buddhism writ large. It is clear from your announcement that a great deal of thought and energy has gone into what is truly a ground-breaking project. Given my own heartfelt desire to make the wealth of Buddhist literature accessible to Anglophone readers, I am not insensitive to the potential fruits of your undertaking. Nevertheless, as someone whose personal and professional life will be impacted by what is announced here, I read your message with great reservation.

As a Buddhist, I feel that machine translations are a violation of the sacredness of the Buddha’s words that rightfully live in the hearts and on the tongues of Buddhist faithful. Sacred scriptures are not a commodity to be harvested by machines, even if they have been treated as such by modern scholars for some time. How many Buddhists are even aware of these developments? Should they not have a say in the fate of their holy texts? You raise the question of potential merit attached to this project. Have you also considered the possibility of sin and suffering?

As a humanist, I am dismayed that ethical considerations—which should be the starting point for an endeavor of this magnitude—seem to have been something of an afterthought. The points you raise in your announcement are insufficient for the gravity of the matter and, moreover, put the cart before the horse. I would ask the members of this list, Do we scholars of the humanities really want to lead the way in displacing the human? On an individual level, do we really want to become “gleaners and cleaners”? Might such a sharp restriction in the scope of our discipline seed its demise? I know I would not have entered graduate school with this future in sight. As a point of comparison, has the era of digital texts produced better scholars than those who came of age reading printed pages?

As a human being who hopes to live a few more decades on this planet and, more importantly, hopes her children will live several more, I despair in the face of this Brave New World in which a small number of people can unilaterally push humanity across yet one more Rubicon, colonizing ever more private recesses of human lives with their insatiable technology. Just because we can do something, doesn’t mean we should, and we are running out of time to even ask the question.

I sincerely hope your announcement will prompt the robust debate and responsible action that this topic deserves. If it turns out that we don't want to be replaced by the machine, there is a deceptively simple solution: Stop training the machine to replace us.

Yours,
Meghan Howard

PhD candidate
Group in Buddhist Studies
University of California, Berkeley

Esteemed colleagues and scholars,

(Disclaimer: I’m currently associated with the 84000 project and with none of the other projects mentioned below, and the views that I express here are mine alone and do not reflect the views or interests of any of the mentioned projects or the persons associated with them).

First of all, I want to congratulate the team of the CBETA machine translation project for their milestone achievement.
And then I also want to share that I felt a sense of relief when I read Meghan’s post, and I’m genuinely glad that someone so eloquent and engaged in the Buddhist and Buddhist studies communities as Meghan expressed concerns about this recent announcement which appears to come across as almost banal—as if the only decision left for us human translators was whether we are “gleaners and cleaners” or “feeders” (or “reapers” or “sowers”? I’m not exactly sure what image “gleaners” is support to invoke) when the machines finally take over. I, too, feel that this announcement in many respects falls short of inviting a more serious discussion (at the very least, the age-old question of what actually is a translation and, if it's not a simple decoding process, how exactly does it work?). My impression is that this may partly be due to the fact that the announcement is not really meant as an invitation to have a first principles discussion of sorts (the digital humanities may be past that point, or simply too caught up in very complex technological questions). For those at the forefront of the digital humanities turn there aren’t many if-questions but rather more when-questions. The timing of the announcement is also fairly interesting (although I’m sure this is really just a coincidence), as it coincides with a recent discussion of Google’s “silent” roll-out of the Google Sanskrit-translator on the Indology mailing list (See https://www.mail-archive.com/indology@list.indology.info/msg01442.html; Professor Aleksandar Uskokov’s verdict: “It will be a while before it [i.e. Google] becomes a philosopher,” and, one may add, a poet) and, much more recently, with Google engineer Blake Lemoine’s (currently unemployed) odd claim that Google's chatbot was a sentient being (News from June 14, 2022, or wait—was that a movie I just watched?!). For some of us, I’m sure, the fact that Lemoine was let go for disclosing company secrets rather than for the oddity of the underlying assumptions of his claim (and concerns about his stress-level and mental wellbeing status) is perhaps even more baffling. On closer inspection, though, even if we are firm believers in the unstoppable advance of machines even in the humanities, we may want to ask some more fundamental (and, if the dilettante who is writing this is roughly up-to-date, as yet unresolved) questions such as, what is sentiency, what is mind, what distinguishes us from computers, are humans just (im)perfect computers (Daniel C. Dennett and others come to mind), what is it, if anything, that computers might do better than humans, and vice versa, or, perhaps even ore trivial but no less relevant, how do(es) language(s) work, etc.?

To be honest, after I had read the CBETA machine translation announcement seemingly notifying me of my imminent expendability, my blood pressure spiked a little, too. Like so many other lucky individuals these days, I make a living by translating ancient Buddhist literature, and I love it. I also was lucky enough to have had the opportunity to work for some years in a successful digital humanities project on Buddhist Sanskrit lexicography. I must admit that the more complex technical issues of computer-aided lexicography were largely lost on me. (But to be fair, it also helps if you have a background in computer-science or programming, or at least a solid interest in these things, both of which I lack). But one thing quickly became clear to me, and I apologize for this rather shallow conclusion: the quality of the output solely depends on the quality of the input. While computers are able to sift through millions of words in no time and are able to detect collocations that can help humans “do lexicography” much quicker and much more reliably than any human ever could, to train the computer to do that, a human being first needs to feed the computer meaningful information and then instruct it what to look for.

The Go-example, in my humble view, might be misleading (but I admit that I don’t play Go). If there is a finite set of however many fixed rules of possible combinations, the computer will eventually beat you because it can process possible (winning) combinations much quicker. But does the same apply to languages with their potentially limitless expressive possibilities (the monkey theorem comes to mind)?

That said, my participation in a computer-assisted lexicography project was a very worthwhile endeavor and I have learned a lot. We believe that we have produced tools that, given more time, will proof very useful for translators and Buddhist scholars. (If interested, check out the segmented Corpus of Buddhist Sanskrit (proof of concept): https://zenodo.org/record/5847100#.YqtcIS-B2IE, and https://mangalamresearch.shinyapps.io/VisualDictionaryOfBuddhistSanskrit/) Needless to mention, there are myriad other projects and scholarly products that many professional translators of Buddhist texts use probably on a daily basis for their research, such as the Digital Dictionary of Buddhism, the GRETIL corpus, the BDRC, the Digital Corpus of Sanskrit, the Digital Sanskrit Buddhist Canon, the University of Vienna’s Kanjur and Tenjur Resources, Marcus Bingenheimer’s own Bibliographies of Buddhism in Western Languages, the Cologne Sanskrit Dictionaries, Christian Steinert’s Tibetan-English Dictionary, etc., you name it…

I don’t know about Chinese, but in the Buddhist Sanskrit field, we are a long way from even basic things like a sufficiently large (and reliable) corpus. There definitely is the quantitative dimension, i.e., certain threshold numbers of tokens that make computer-assisted lexicography statistically relevant. And we should not forget what lies behind all the digital bling-bling: countless hours of human brain activity (and manual input)—hard (human) work. But the qualitative dimension will also always remain. The way we humans construe meaning is unique. From my (admittedly limited) experience with “digital lexicography,” I do not believe that computers will be able to figure out the meanings of obscure passages in ancient Buddhist texts any time soon (if ever). For me, sentient bots and robot wars are the stuff of Science Fiction. Machines do not and cannot understand what they are doing (pace Lemoine).

So, have I, and would I again, use DeepL and CBETA to quickly “glean” the gist of a document that is written in a language I don’t read or a Buddhist sūtra? Yes! (I also don’t read Chinese but have found using CBETA with its integrated dictionary an enjoyable pastime). Would I want a computer to be able to translate, for instance, pages and pages of doctrinal lists, or sheer endless mechanical repetitions of (let’s be honest, boring) syntactical constructions with very little change in semantic content? Yes, please! Do I quickly look up a colleague’s previous rendering of a recurring phrase I’m unfamiliar with instead of pondering over the (subjectively) perfect phrasing in English? Yes! (In the 84000 Community Resources we have an awesome tool called Translation Memory Search; also check out the Cumulative Glossary, https://read.84000.co/glossary/search.html).
But am I afraid that computers will soon replace me as a translator of mostly obscure, ancient texts? Not at all. While I share Meghan’s sentiment, my appeal is the opposite: Feed the machine more, so that it can do what it does well better and assist us!

Last but not at all least, part of the discussion in my view should be whether or not these machine translations meet any kind of scholarly, Buddhist, or aesthetic standards (as defined by humans, i.e., Buddhists and the scholarly community). Frankly, I don’t think that the scholars involved in the project and, or even less so, the software company (that ultimately wants to sell a product) can be and should be the sole judge of that (buzzword bias). Moreover, I don’t feel that the ensuing discussions should, or that they indeed will be, limited to discussions of “typologies of mistakes,” (in itself time-consuming work only humans can do) but will also have to include more fundamental questions. I genuinely look forward to “gleaning” a machine translation of one of my favorite Buddhist sūtras from the CBETA corpus when they are deemed “ready” for human consumption. But I suspect it will be a while until then.

Respectfully,

Bruno Galasek-Hul
Eureka, CA

A thoughtful rersponse. But: might a Chinese Buddhist have said something similar when the printing of Buddhist texts began...? Or even when texts started to be commited to writing...?
Of course a computer translation cannot replace a human translation, but it may be a useful starter ...
Peter Harvey

Dear Bruno,
thank you for this contribution. My worry as someone who trains students is basically that with more and more machines "aiding" the process of translation, we will have les and less students who are really willing to learn the craft of translation (let alone edition). So, who will feed the computers when there is no one left who has learned to translate (or edit)? If it is true that the machines are only as good as the people who train it, we will soon reach a point of no more improvement. Unless, of course, the machines start to be really *creative* and begin to teach themselves, which is something that has been said for some time now they will do soon. If that is ever possible in any real sense, the problem will soon be that we people do not understand what they, the machines, say. Where is the improvement in that?
Jan-Ulrich Sobisch

Dear Jan-Ulrich,

I very much share your concerns regarding the impact of digitization (of which machine translation is one aspect) on education. The teaching of modern and ancient languages will change considerably, when MT becomes ubiquitous even for little fields like ours, but I don't think that we as a field can come up with an answer for that. These are larger institutional developments.

Our original post was intended to alert our colleagues that the general progress in MT has reached us now and we believe it is better to face the changes actively and try to make the best of it, rather than to ignore them.
In this particular moment, we believe that developing open models, sharing training data, and caring about evaluation strategies is our best chance of reducing the confusion that a proliferation of machine translations of Buddhist texts will create. We consider the arrival of such half-baked machine translations unavoidable, even if we as a field tried to ignore them. Consider how many political and religious organizations would see it in their interest to translate their textual heritage widely.
If academics in Buddhist Studies can move forward with radically transparent research on how to train, evaluate, and use the output of such models in our domain, we will have done our part.

All the best

marcus

--
Dr. Marcus Bingenheimer 馬德偉
Department of Religion, Temple University
https://mbingenheimer.net

Respected academics and scholars,

May I offer an alternative way of making sense of this:

In a remote (or near) future, when AI becomes sentient, it would be very welcome that AI could become Buddhist, and now the DeepL model is the most promising candidate. From machine learning to the conversion of machines...

Yours,
Chong Fu

Hello everyone,

I have been reading this thread with interest. As a Japanese-English translator with a background in Buddhist studies, I would like to mention that DeepL and other similar resources are currently widely used by professional translators to speed up their work. For example, many translators currently maintain translation memories of their own translations, and then use that data to create machine translation engines that translate like themselves. (Of course, at least currently, the end product's quality depends entirely on the skills of the human translator involved.) This announcement seems to have raised some concerns, but, at least from my perspective (and I think that of many others in the industry), the use of MT is quite old news. It's an "already arrived" reality to which we must adjust in order to stay competitive as professionals.

Best,
Dylan

--------
学術翻訳者(日英)
Japanese to English Academic Translator
www.dylanluerstoda.com

My brief remark made me look like I was not using digital tools. And since that point has been brought up several time: Yes, of course I use every digital tool I can get. But that does not solve the problem of future generations. In previous centuries, the translating of text has been improved again and again. Some translations have been retranslated and usually improved several times. I used to expect that this will also be the case in the future. But if students stop to learn editing and translating, which is already visible and which I expect to become worse when they can have DeepL do their homework, the process will come to a standstill. Again, if it is true that the machines are only as good as the ones who train them, improvement will come to an end soon. That is the real problem that I see.

Yours,

jan

Deep Learning is the most rapidly evolving area of computer science, and Natural Language Understanding (NLU) and Machine Translation (MT) are the most rapidly evolving areas within Deep Learning. Marcus Bingenheimer has always been at the forefront of applying computer technology to Buddhist studies, and I believe he has earned his right to speculate on the future. Among researchers familiar with both NLU research and languages of the Buddhist canons, Sebastian Nehrdich stands at the top. Their collaboration is certainly an advance in application of NLU to the translation of Buddhist texts.

All that said, I believe that what state-of-the-art NLU can practically do for the translation of Buddhist texts remains to be seen: I don’t see any manifest destiny here. On the contrary, progress currently depends on the efforts of a small number of individuals, and these efforts are dwarfed by teams at Google, Microsoft and other corporations that use 10,000 of processors to train NLU models 200 times larger than the mBART model employed by Sebastian and Marcus. Understandably, these commercial efforts are primarily focused on languages that have a significant customer base.

In any case, every individual reading this can help make NLU more useful to Buddhist translation. The most beneficial thing that I expect any one of you can do is to aggregate translation pairs of the form:
< “sentence”-in-source-language, translated-sentence-in-target-language>.
Short of that, translated documents, at whatever level of granularity (document, page-level, etc), will be helpful. There are a number of groups (84000, Tsadra, Khyentse Vision Project, esukhia, Kumārajīva, BDRC/TBRC, BuddhaNexus, padma.io) and universities (Hamburg, Temple, Cambridge, Dublin, Berkeley) informally working together to use computer technology to improve translation; however, simply to seize the moment: if you have some significant (>20 pages) bilingual materials of this variety to share, I would encourage you to reach out to Sebastian Nehrdich (nehrdbsd@gmail.com) and/or Marcus (m.bingenheimer@gmail.com) . If I may be of any assistance, don’t hesitate to reach out to me (keutzer@berkeley.edu) as well. In time, I’m sure we’ll sort out some way to allow individuals to automatically upload smaller contributions.

For those more theoretically inclined, NLU technology suggests new quantitative approaches to assess translation fidelity. With some insights it should be possible, as Marcus suggests, to quantitatively benchmark translations.

With respect to ethics, from the printing press to OCR’d e-texts, technology has always been a double edge sword. I don’t hesitate to use electronic dictionaries, but I don’t abandon my critical thinking when doing so. We’ll be able to interactively examine automated translations of entire passages or texts in the near future. How practically useful those automated translations are will depend on the data gathered from all of you.

Kind regards, Kurt Keutzer

Dear all,

I may add a few notes to the current discussion on AI machine translation of Buddhist texts.

1a. Any translation project ought to start with securing the source text. That is, the best available version of the text has to be chosen -- if possible, a critical edition. If none is at hand, the available versions have to be collated and decisions have to be taken which reading will be preferred. The CBETA version is very convenient as it is in digital form and thus can easily be "cut and paste" into an own paper or book and allows full text search. However, CBETA comes with at least two weak points: First, the variorum apparatus provided by the printed Taishô edition is lacking. Second, as Christian Wittern once privately admitted, the digitizing process itself was not without producing misprints. As CBETA is such a convenient source, students are but rarely motivated to check the printed Taishô edition or to consider mss. such as, e.g., Dunhuang mss.

1b. Quotations of the work in question appearing in other texts ought to be taken into account and, if relevant, have to be discussed in the notes. For example, the first "article" of the Mouzi Lihuo lun (in: Hongming ji, T 52, 2102) is quoted in an abridged way (as usual) in the Song (AD 984) leishu Taiping yulan (653.3b). Collation of quotation and source text reveals that, apart from the transposition of a few phrases, Taiping yulan additionally contains the phrases 父曰白凈,夫人字曰凈妙 which may have been a comment that eventually became integrated into the quotation's main text. More serious is the fact, that the Taiping yulan has the information 佛精從天來 which precedes the description of Śuddhodana's wife's conception, which leads to the question what 精 might mean in this context. Also it mentions three bodily marks of the Buddha that are no longer seen in the received Lihuo lun: 皮不授塵水,手足皆鉤鎖,毛悉向上. From a buddhological point of view, this may not seem to be important, but for, e.g., art historians interested in the Buddha's iconography or historians of religion who want to know at which time which Indian ideas were transferred to China this could be interesting. We may also ask ourselves whether the original Lihuo lun contained the full description of the thirty-two major and eighty minor bodily marks. However, it should be evident that the received Lihuo lun is no longer the original one.
It goes without saying that none of all this is in the CBETA version. And the CBETA version is used for the AI machine translation.

2. No translator can do without dictionaries. It is a truism that the quality of Chinese-English dictionaries by far can't match that of, say, Liddell & Scott's Greek-English Lexicon. We need time-specific, genre-specific, subject-matter-specific, author-specific works that not only allow to recognise the semantic field of a given word in a given time period but also its diachronic change (which for the same word may have taken place with different speed in different subject areas). Whereas it is known that DeepL and Linguae Dharmae are using the CBETA text version, it is not known (at least not to your humble servant) which dictionary (-ies) is (are) used within these programmes. It is part and parcel of any translating process to elucidate the specific meaning of a word within a given text. If the text is a pastiche of various other texts, then the meaning of the same word, at least in theory, may be different in each of the different parts. To my knowledge both mentioned AI programmes are unable to do this.
Just for fun, let's make a (Western) example. In Shakespeare's Richard II (3.1.1) the phrase occurs "[You have] Dis-park'd my Parkes, and fell'd my Forrest Woods". In order to understand the meaning of its first part, we must understand what "parke" here means. It has nothing whatsoever to do with the parks seen in present-day Great Britain. Rather, it means in Shakespeare's time the "hunting park" or "land used for breeding game" of a member of the high nobility. The phrase, accordingly, means: "You have killed my game so that I no longer can hunt in my area", which must be made clear in any translation into, say, French, Spanish, German etc.

3. A translation reflects the translator's understanding of the source text. Algorithms as well as deep "neuronal" networks are unable to understand a text. They are based on pattern searching, pattern matching and statistics. (We will see exemplary sentences shortly.) To understand Chinese Buddhist texts implies to understand the Chinese misunderstandings, as many of the foreign concepts and ideas had no equivalent in the Chinese culture.

4. It makes, of course, a difference whether a translator is allowed to add foot- or end-notes, in order to clarify things and legitimate her/his translation of a given phrase (like, e.g. Pelliot as well as Keenan, see below) or is not allowed to do so (like, e.g., Hirano Ziegler, see below). In the latter case, a translator may try to translate more freely in order to make clear what she/he means.

I may leave it at that. Whoever is interested in the wide area of problems connected with translation in general may read (inter alia) George Steiner's After Babel. Aspects of language and translation (Oxford 1998, 3rd ed.). Let me now turn to a concrete example of DeepL's translation of a Chinese Buddhist text. My special thanks go to Sebastian Nehrdich who has kindly provided me with the output of DeepL's translation of the Mouzi Lihuo lun (in: Hongming ji, T 52, 2102). I will align Paul Pelliot's translation ("Meou-tseu ou les doutes levés", T'oung Pao19, 1919, pp. 255-433)[P], John P. Keenan's (How Master Mou removes our doubts, SUNY 1994)[K], Harumi Hirano Ziegler's (The collection for the propagation and clarification of Buddhism, vol. 1, BDK America 2015)[Z] and DeepL's [DL] translation of the beginning of its first "article", so that the kind reader can assess her-/himself the merits and demerits of the different results.

Caveat: 1. The Chinese text is that of CBETA, DeepL's source. 2. The notes by Pelliot and Keenan are not reproduced.

[...]牟子理惑云。
P: Meou-tseu ou les doutes levés.
K: How Master Mou Removes our Doubts.
Z: [Discourse on] the Elucidation of Delusions.
DL: [...]The cloud of Muzi's principle of delusion.

[Article 1]

或問曰。佛從何出生。寧有先祖及國邑不。皆何施行。狀何類乎。
P: On demande parfois: "Comment donc est né le Buddha? A-t-il des ancêtres et une ville natale? Qu'est-ce qu'il a fait? Quelle espèce d'homme était-ce?"
K: A critic asked: Where was the Buddha born? Did he have ancestors and a home place or not? Just what did he do? What did he look like?
Z: A person asked, “From whom was the Buddha born? Does he or does he not have ancestors and a country [to which he belonged]? What does he bestow on all people? What does he look like?”
DL: Some people ask: "Where was the Buddha born? How could there be ancestors and countries? How did they all practice? What kind of shape does it take?"

牟子曰。富哉問也。請以不敏。略說其要。
P: Meou-tseu dit: "Immense en vérité est cette question; je n'ai pas l'esprit vif, et je demande à ne répondre que l'essentiel.
K: Mou-tzu said: Rich indeed is the significance of your question! Even though I am not very bright, let me mention the main points.
Z: Mouzi replied, “You have big questions. I am dull, but I will roughly explain the outline [of the biography of the Buddha].
DL: Mouzi said: "How could you ask such a question? Please take it as a sign of insight. He'll briefly explain the essentials.

蓋聞佛化之為狀也。積累道德。數千億載不可紀記。
P: Or j'ai entendu dire que depuis que le Buddha passait de forme en forme, amassant le dao et le de il s'était écoulé des milliers de centaines de milliers d'années, à ne pouvoir les compter.
K: I have heard about the appearance of the Buddha's transformation, that when he had amassed the power of the Way for many countless aeons to a fullness unrecordable,
Z: I have heard the state of the Buddha's transformations. It is impossible to record his [entire] chronicle, since he accumulated moral conduct for several hundreds of billions of years.
DL: It's called the transformation of a buddha by hearing about its appearance. It's an accumulation of virtue. It's indescribable for hundreds of millions of years.

然臨得佛時。生於天竺。假形於白淨王夫人。
P: Quand il fut près d'obtenir l'état de Buddha, il naquit dans l'Inde. Il emprunta sa forme dans la femme du roi Suddhodana.
K: when he was on the verge of realizing awakening, he was born in India. He borrowed his form from the wife of Śuddhodana.
Z: Nevertheless, when he came into [the existence in which he would become] the Buddha, he was born in India, in the provisional form of the son of [Lady Māyā], the queen of King Śuddhodana.
DL: But when he was about to attain Buddhahood, he was born in India. He assumed the form of Madame Prabhūtaratna.

晝寢夢乘白象身有六牙。欣然悅之。
P: Comme celle-ci faisait la sieste, elle reva [de quelqu'un] monte sur un elephant blanc dont le corps avait six defenses. Toute contente,
K: During a nap she dreamed that she was riding on a white elephant with six tusks. Delighted,
Z: Once, while taking a midday nap, she had a dream that she rode upon a white, six-tusked elephant. She was very delighted
DL: She dreamt in her daytime sleep that she was riding in a white elephant with six tusks on its body and was delighted with it.

遂感而孕。
P: elle s'en félicita; ensuite, elle fut émue et devint grosse.
K: she was filled with joy and, affected [by it], conceived.
Z: and accordingly felt that she had become with child.
DL: She was then impressed and conceived.

以四月八日。從母右脇而生。
P: Le 8e jour du 4e mois, [le Buddha] naquit du flanc droit de sa mère.
K: On the eighth day of the fourth month, he was born from his mother's right side.
Z: On the eighth day of the fourth month, [the child] was born from the right armpit of his mother.
DL: It was on the eighth day of the fourth month. He was born from his mother's right hypochondrium.

墮地行七步。舉右手曰。
P: En arrivant à terre, il fit sept pas, leva la main droite et dit:
K: When he trod the earth, he strode seven paces, lifted up his right hand, and declared:
Z: He landed upon the ground, took seven steps forward, and said, raising his right arm,
DL: She fell to the ground and walked seven steps. He raised his right hand and said.

天上天下靡有踰我者也。
P: "Dans le ciel, sous le ciel, il n'est personne qui me depasse."
K: "Above or below the heavens, there is no one who excels me!"
Z: ‘There is no one in Heaven and Earth who goes beyond me.’
DL: 'There's no one in heaven or on earth who surpasses me.'

時天地大動宮中皆明。
P: A ce moment, le ciel et la terre furent grandement ébranlés, et le palais fut tout illuminé.
K: At that instant the heavens and the earth shook violently and the palace was filled with light.
Z: At that time, Heaven and Earth quaked tremendously and the palace became bright all over.
DL: There was a great upheaval in heaven and earth, and the palace was awash with light.

其日王家青衣復產一兒。厩中白馬亦乳白駒。
P: Ce jour-là, une servante de la famille royale mit, elle aussi, un fils au monde et, dans l'écurie, une jument blanche mit bas un poulain blanc.
K: On that same day a blue-robed [servant girl] of the king's household gave birth to a son and in the stable a white horse dropped a white colt.
Z: On the same day, a servant of the royal family also gave birth to a child, and a white horse in the stable bore a white colt.
DL: On that day, the king's blue robe again gave birth to a son. A white horse in the stable also had a white colt.

奴字車匿。馬曰揵陟。
P: Le [jeune] domestique fut appelé Tchtö-ni (Chandaka) et le cheval fut nommé Kien-tche (Kanthaka).
K: That servant was named Chandaka and the horse Kaṇṭhaka.
Z: The [servant’s] baby was called Chandaka and the white colt was named Kaṇṭhaka.
DL: He called his son Chandaka. The horse was called Gandhāra.

王常使隨太子。
P: Le prince les mit tous deux au service constant du prince héritier.
K: The king ordered them always to attend the prince.
Z: The king instructed them to always attend the prince.
DL: The king always sent him to follow the prince.

太子有三十二相八十種好。
P: Le prince héritier avait 32 laksana et 80 anuvyañjana.
K: The prince had thirty-two major marks and eighty minor marks.
Z: “The prince had thirty-two major marks and eighty minor marks of physical excellence.
DL: The prince had thirty-two features and eighty beauties.

...

This may suffice for a first impression. Note that the DeepL translation given here represents the present state of the art. I understand that by adding more training sessions, which are planned, its quality surely will improve.

In conclusion, I think that AI machine translations can be very useful as long as their limitations are taken into account and as long as they are regarded as sort of a "thesis" against which one can produce her/his own translation as an "anti-thesis".

Kind regards
Stephan Peter Bumbacher

Dear Dr. Bingenheimer,

I’m glad to know that you share my concerns regarding potential impacts on education. But I disagree with your contention that this is a fait accompli and we simply need to figure out how to adapt (which was also echoed by Dylan Toda). If humans are controlling the technology—and so far it seems that we are still in control of the machines—we can also stop using it should we so wish. What we used to call the “military-industrial complex” may have decided that AI is inevitable, but it’s not inevitable in our field unless we let it be. Moreover, the Linguae Dharmae project that you are working on is actually hastening its arrival in Buddhist Studies. So what you’re effectively saying is not that it’s inevitable but that you want it to be so. And so I’d like to ask you, why? What’s in it for us? Sell it to the skeptics reading this list.

If the answer is truly that no one wants this but it’s unavoidable, well, then as humanists we need to step up and reassert human agency. I benefit on a daily basis from the digital tools already mentioned in Bruno Galasek-Hul’s illuminating contribution to this thread. Perhaps machine translation is just one more of these tools. Perhaps it is something different. (I happen to think there is a big difference between the translation “memory-banks” mentioned in several responses and the actual deep learning/AI translation machines that your team has announced. I also think that the fact that this new digital tool is to be applied to a broad swath of Buddhist scriptures carries weighty ethical implications that have not been addressed.) Given its potential ramifications, the matter deserves open and vigorous debate—and that debate, crucially, should include a wide spectrum of Buddhist voices as well as academics. To that end, I’m grateful for your initial announcement and the responses it has elicited.

I also have a technical question for the Linguae Dharmae team. Why was the CBETA data supplied to DeepL by SuttaCentral? Maybe I have misunderstood your description of events, but why didn’t the data come directly from CBETA?

Yours,
Meghan

Dear Meghan,

>> I also have a technical question for the Linguae Dharmae team. Why was the CBETA data supplied to DeepL by SuttaCentral? Maybe I
>> have misunderstood your description of events, but why didn’t the data come directly from CBETA?

Factual questions are easier to answer so let me start with this: As the announcement says "DeepL ... has used training data from the Chinese Buddhist canon (Chi/Eng sentence pairs) provided by SuttaCentral". Training data is in this case used to fine-tune the pre-trained statistical language model of DeepL for Buddhist Chinese. SuttaCentral has long been actively aligning Buddhist texts with their translations. Much of their corpus is on Github. CBETA provides the largest and best curated corpus of Buddhist Chinese, which - pace Prof. Bumbacher - does not only fully include the Taishō apparatus, but has expanded and improved on it. (Dto. for the punctuation). There are different versions of the CBETA corpus available, these days I prefer: https://github.com/DILA-edu/CBETA_TAFxml. (This repo is not complete, but it has the largest sections and I like its markup best.)
Again, CBETA is only the source of the Chinese, the actual training data consists of Ch-En sentence pairs and SuttaCentral had made a great start on alignment, so DeepL was able to use that.

Regarding your concerns involving the "military-industrial complex" and the inevitability of AI, I am afraid I don't have a good answer. I also don't know what is "in for us" and am a terrible salesman. Speaking strictly for myself, I have decided 25 years ago that Buddhism is full of good ideas and that having Buddhist texts around in different encodings and languages is a wholesome thing. A legacy decision of sorts, but these are the things we live by. A list of translations shows that over the last 150 years we have translated some 10-20% percent of the Chinese Buddhist corpus into "Western" languages (http://mbingenheimer.net/tools/bibls/transbibl.html). I was never much bothered by that pace, good things take time. But still the idea that suddenly, within my projected lifetime, we might have "good enough" translations of the whole corpus is fascinating to me. "Good enough" is actually a term used in machine translation. In our case it denotes a translation that is easier to debug than to do from scratch. My sense is that our current models get c. 5-15% of sentences right. If we can get that to c.60% we can start reading these translations and get a good sense of what the text is about. At around 80% accuracy, translators can "relatively" quickly fix the mistakes, or in any case much faster than starting from zero.
Another aspect of this endeavor is that I always wanted to see more translations of Buddhist texts into non-European languages (Arabic, Hindi, Swahili, Yoruba etc.). With better MT this now has entered the realm of the possible.
Getting there will take time and effort. My current guess of 10-20 years might be wrong. Prof. Keutzer's caution in this regard is appreciated. We might encounter some difficult problem and plateau off for a few decades, but given that we have so far used very little training data and small models, as well as the rapid overall development of neural MT there is room for optimism.

All the best

Marcus

Dear Colleagues

I do not wish to enter the discussions about machine translation in a wider perspective, other than to offer a small note about what is essentially a technical issue, though it could have somewhat wider applicability as well.

In his interesting compilation, Stephan Peter Bumbacher offered the following example:

[...]牟子理惑云。
P: Meou-tseu ou les doutes levés.
K: How Master Mou Removes our Doubts.
Z: [Discourse on] the Elucidation of Delusions.
DL: [...]The cloud of Muzi's principle of delusion.

The interesting point: 云 (yún) to the best of my knowledge is never used in pre-modern Chinese as a replacement for 雲 (yún), and as far as I can remember I have never seen it even in manuscripts which otherwise do employ what we might think of as "abbreviations." However, evidently the lexicons used by the team working on the DL project do not carefully enough distinguish such historical usages. This is something which I think can be addressed by a proper compilation of accurate resources. Paul Kroll's invaluable "A Student's Dictionary of Classical and Medieval Chinese," for instance, is a historically sensitive lexicon, but it is not available in Open Access simply to be ingested; that it is comparatively weaker on Buddhist vocabulary is not important since this can be covered from other sources. But we must also note that even sources like the DDB usually make no attempt to distinguish uses, for instance meanings of words (and it is important to remember that we are talking about words, and not "characters") that differ, for instance, between An Shigao and Zhiyi. All of these issues are, I am quite sure, not at all far from the mind of Marcus and his team, but it seemed perhaps not entirely useless to mention it.

Best,
Jonathan Silk