can anyone recommend a tool for corpus analysis in Chinese language (1) on basis of words (not single characters, so it can find e.g. frequency of 社会, not just 社), where (2) I can analyze documents I select myself? (free software would be best ;).
Language-neutral softwares such as AntConc have difficulties to recognize Chinese since there are no spaces between words. Tools like lancaster corpus can identify words, but it analyzes a corpus from the internet, so I cannot for example analyze the language in a sub-corpus I have collected myself.
I want to use the software for discourse analysis, therefore I need to select the sources myself (not take random text samples from the internet). I used AntConc and inserted empty spaces between words, but it is really tyring and I am looking for better solutions.
Thanks a lot for the help,
Hi there, a wide-spread solution for Chinese texts is MARKUS (https://dh.chinese-empires.eu/markus/beta/). Maybe it will suit your needs too.
You might post this query on the Digital Sinology page on Facebook. There are many experts there.
I personally use AntConc, a donation-based (so, free-if-you-want-it) concordance tool. You can use it with your own text files and you can select single- or multiple-character strings.
The developer's site can be found at: https://www.laurenceanthony.net/software/antconc/
Best of luck,
A. Charles Muller
Have you tried the parsing tools available at SmartHanzi.net (https://www.smarthanzi.net/) including DDB Access (https://www.smarthanzi.net/ddbaccess/index.php)?
I would recommend DocuSky. You can build your own database collection of texts, attach your own metadata, and search for thousands of word colocations, thus 社， 社會，社會群， and get outputs that you can process in Palladio and other useful search softwares. It has a host of tools. I agree with Elena Valussi above, Digital Sinology is good
as well as 數位人文研究(Digital Humanities)
You can see some examples of my DocuSky database here:
You can also try Voyant, which is easy to setup and has a powerful set of tools.
SmartHanzi and DDB Access, kindly mentioned by Charles Muller, are basically interactive tools. Following this discussion and in liaise with Marius Meinhof, a feature was added to the applications (Windows versions) to produce a file with spaces between words. It was confirmed that this segmented text file can be used with AntConc.
The next step was to make a more effective dedicated tool. "Chinese Bulk Parser" can process multiple files with one mouse click. One just has to drag and drop files into the application window.
Usual formats (.pdf, .rtf, .doc/docx, etc.) with text content are recognized. OCR is not supported.
The application relies on CEDICT, CFDict, HanDeDict for contemporary Chinese, and DDB (Digital Dictionary of Buddhism) and CJKV-E (Dictionary of Confucian, Daoist and Intellectual Historical Terms) for classical Chinese.
It can be used for contemporary texts as well as classical Chinese and texts from the Buddhist canon.
"Chinese Bulk Parser" (free application) can be downloaded at www.smarthanzi.net/tools
A new version of Chinese Bulk Parser (v2023.04) is available at www.smarthanzi.net/tools
CBP can segment Chinese texts for input in text analysis programs. The new version has selectable dictionaries: for Classical Chinese, this avoids recognizing modern words from standard dictionaries.