Big data is critical for shaping 'digital humanities'

Published: 23:52, March 27, 2024 | Updated: 14:17, April 8, 2024

By Wong Kam-fai

With the rapid development of the digital economy, digital humanities has become a popular research field in many institutions worldwide. Over the past year, generative artificial intelligence (GenAI) has emerged, bringing many conveniences to the human race. Officials, industries, academia and research sectors around the globe are all vying to use it. This trend is unstoppable and will continue to be the driving force in the innovation and technology industry in 2024. The goal of AI research is to let machines replace humans, so the subtle relationship between AI (digital) and human intelligence (humanities), and how the two interact and cooperate, are key topics of concern for many digital humanities scholars, including this author.

Theoretically, digital (D) humanities (H) can be divided into three categories. The first is D2H: How to use data to analyze and understand the culture of the real world? This is a typical application of big data. The second is H2D: How to imitate real human culture, transform it to the virtual world, and achieve the effect of “digital twins”? The third is D&H: How to promote the interaction between the real and virtual worlds, and build an efficient “cyberphysical system” to network the physical world?

Simply put, from an academic perspective, digital humanities include the four major disciplines of linguistics, history, philosophy and arts. Computer scientists have been continuously researching, trying to digitize these subjects, expand and deepen their content, promote interdisciplinary studies, and help optimize teaching and learning outcomes. However, if digitization is inappropriately applied, it will inevitably affect the connotation of the subjects. However, regardless of the discipline, digital humanities are closely related to data. This article lays down the objective: to highlight the impact of “digital” on “humanities”.

For AI in linguistics, natural language processing (NLP) technology is used for language analysis and understanding. NLP capabilities are based on deep learning and require the support of large corpora (that is, text big data) for system training.

Compromised data not only impedes the economic progression of the Hong Kong Special Administrative Region, but it also creates a loophole that could potentially jeopardize China’s national security, providing criminals with an opportunity for exploitation. Hence, as we move further into 2024, digital security, encompassing network security, AI security, and the like, are of paramount importance to economies worldwide. The HKSAR government must not overlook them

Corpus training can easily lead to the effect of “language discrimination”, which in turn raises the issue of “language conservation”.

Large corpora are mainly based on commonly used online languages. For this reason, ChatGPT can fluently converse with users in English, Chinese, Spanish, and Arabic (the languages most used on the internet currently), but it is helpless with languages that have not been digitized. For example, the least-used language in the world is Ayapaneco, an ancient language used by a tiny number of people in Mexico. There is no digital form of the language online. Some experts estimate that the least-used, low-resource languages will vanish from the internet, leading to the disappearance of their related cultures. What is even more frightening is that if this unhealthy situation continues, the culture of the future online world will be manipulated by the most powerful nations.

For AI in philosophy, take ChatGPT as an example. Building ChatGPT utilizes deep learning extensively, the method is like a parrot mimicking, learning conversational skills from a large corpus. Therefore, the quality of the corpus is critical. The most common flaw is the hallucination effect — ChatGPT will make things up and answer off topic due to insufficient training data. Moreover, hallucination can produce chain effects, where one wrong answer will naturally affect the next user prompt, and the subsequent reasoning and answers, resulting in a series of mistakes.

History is based on the archives of past events. The deep-learning technology can certainly make history cover knowledge more deeply and broadly, but this advantage requires the authenticity of training data. However, deep learning is mainly a set of calculations based on statistics. It does not care about the authenticity of the data as it does not perform “fact checking”. Furthermore, whether the output historical event is true or false, the system cannot explain the results. The digitization of history also has a domino effect. If unchecked historical events are spread inaccurately, the credibility of future digital history will be greatly discounted.

In the realm of art, one may refer to the recent copyright infringement lawsuit against ChatGPT, filed by The New York Times in the United States. In this case, the defendants, OpenAI and Microsoft, were alleged to have unlawfully utilized The New York Times’ articles to train ChatGPT without obtaining permission. The resultant articles generated by ChatGPT were essentially verbatim, reproducing the original text without any modifications. Moreover, a similar scenario frequently arises with the AI image generator, Midjourney, raising suspicions of potential copyright infringements. Consequently, this prompts the question: Will the future of AI necessitate a redefinition of the creative ecosystem and copyright parameters for automatically generated art? Furthermore, how can the value produced in this process be equitably distributed?

It is emphasized that “Safety underpins development, while development ensures safety. Both safety and development must be advanced concurrently.” Consequently, as Hong Kong fosters the digital economy, it must also give due consideration to “digital security”. In the ongoing fourth generation (AI) industrial revolution, data serves as the pivotal resource for innovation and production, and its integrity must be safeguarded against invasion or contamination. Compromised data not only impedes the economic progression of the Hong Kong Special Administrative Region, but it also creates a loophole that could potentially jeopardize China’s national security, providing criminals with an opportunity for exploitation. Hence, as we move further into 2024, digital security, encompassing network security, AI security, and the like, are of paramount importance to economies worldwide. The HKSAR government must not overlook them.

The author is a member of the Legislative Council, associate dean (external affairs) of the Faculty of Engineering of the Chinese University of Hong Kong, and vice-president of the Hong Kong Professionals and Senior Executives Association.

The views do not necessarily reflect those of China Daily.

Chinese cultural revival: HK hosts grand ancestor worship event

HK to issue T1 signal Wednesday as typhoon Podul nears

Efforts by China's Foshan to curb chikungunya seen to pay off

Relief efforts begin after floods leave 15 dead, 33 missing in China's Gansu

TOP

VISUAL NEWS