Dear Colleagues,
can anyone recommend a tool for corpus analysis in Chinese language (1) on basis of words (not single characters, so it can find e.g. frequency of 社会, not just 社), where (2) I can analyze documents I select myself? (free software would be best ;).
Language-neutral softwares such as AntConc have difficulties to recognize Chinese since there are no spaces between words. Tools like lancaster corpus can identify words, but it analyzes a corpus from the internet, so I cannot for example analyze the language in a sub-corpus I have collected myself.
I want to use the software for discourse analysis, therefore I need to select the sources myself (not take random text samples from the internet). I used AntConc and inserted empty spaces between words, but it is really tyring and I am looking for better solutions.
Thanks a lot for the help,
Marius
7 Replies
Martin Hanker
Hi there, a wide-spread solution for Chinese texts is MARKUS (https://dh.chinese-empires.eu/markus/beta/). Maybe it will suit your needs too.
Elena Valussi
You might post this query on the Digital Sinology page on Facebook. There are many experts there.
Maura Dykstra
I personally use AntConc, a donation-based (so, free-if-you-want-it) concordance tool. You can use it with your own text files and you can select single- or multiple-character strings.
The developer's site can be found at: https://www.laurenceanthony.net/software/antconc/
Best of luck,
Maura
A. Charles Muller
Have you tried the parsing tools available at SmartHanzi.net (https://www.smarthanzi.net/) including DDB Access (https://www.smarthanzi.net/ddbaccess/index.php)?
Regards,
Chuck
Michael Stanley-Baker
I would recommend DocuSky. You can build your own database collection of texts, attach your own metadata, and search for thousands of word colocations, thus 社, 社會,社會群, and get outputs that you can process in Palladio and other useful search softwares. It has a host of tools. I agree with Elena Valussi above, Digital Sinology is good
https://www.facebook.com/groups/digitalsinologygroup/
as well as 數位人文研究(Digital Humanities)
https://www.facebook.com/groups/1229233120487449/
You can see some examples of my DocuSky database here:
https://michaelstanley-baker.com/digital-humanities/daobudmed6d/
You can also try Voyant, which is easy to setup and has a powerful set of tools.
https://voyant-tools.org/
Best wishes,
Michael
Jean Soulat
SmartHanzi and DDB Access, kindly mentioned by Charles Muller, are basically interactive tools. Following this discussion and in liaise with Marius Meinhof, a feature was added to the applications (Windows versions) to produce a file with spaces between words. It was confirmed that this segmented text file can be used with AntConc.
The next step was to make a more effective dedicated tool. "Chinese Bulk Parser" can process multiple files with one mouse click. One just has to drag and drop files into the application window.
Usual formats (.pdf, .rtf, .doc/docx, etc.) with text content are recognized. OCR is not supported.
The application relies on CEDICT, CFDict, HanDeDict for contemporary Chinese, and DDB (Digital Dictionary of Buddhism) and CJKV-E (Dictionary of Confucian, Daoist and Intellectual Historical Terms) for classical Chinese.
It can be used for contemporary texts as well as classical Chinese and texts from the Buddhist canon.
"Chinese Bulk Parser" (free application) can be downloaded at www.smarthanzi.net/tools
Jean Soulat
A new version of Chinese Bulk Parser (v2023.04) is available at www.smarthanzi.net/tools
CBP can segment Chinese texts for input in text analysis programs. The new version has selectable dictionaries: for Classical Chinese, this avoids recognizing modern words from standard dictionaries.