Analyzing Classical Japanese Waka through Embedding Vectors

Yasuhiro Kondo ( Aoyama Gakuin University) 
yhkondo@cl.aoyama.ac.jp

This note  is the English translation of my note found in this link

Analysis of the Style of Japanese Waka Poetry Collections

Each classical Japanese waka poetry collection possesses its unique character. For example, the "Manyoshu" celebrates nature and contains 'simple' poems, while the "Kokinshu" reflects the 'elegant' traditions of the imperial court. Though there are various ways to describe these characteristics, it's undeniable that each collection has its distinct style of poetry. This entry discusses the analysis of these styles using computers, specifically AI. It's an explanatory article based on my paper "Describing the Linguistic Valiations in Waka Collections - An Analysis Using Large Language Models" published in the 19th volume, 3rd issue of the "Studies in the Japalnese Language" (December 2023). The full paper will be available on JSTAGE in June, 2024.

What are Embedding Vectors?

In this entry, we convert each waka poem into numerical values called embedding vectors using large language models (LLMs) like ChatGPT. By comparing these numerical values, we examine the overall characteristics of the poems.

Let's start with a simple explanation of 'embedding vectors.' Consider a basic example of vectorizing co-occurrence distribution.

       高い 長い 登る 流れる 頂上  橋
山  [ 1      0      1       0          1    0 ]
川  [ 0     1       0      1           0      1 ]

Suppose we represent the likelihood of the words 'mountain' and 'river' appearing with other words as 1 or 0. In this case, 'mountain' could be represented as a six-dimensional vector [1 0 1 0 1 0], and 'river' as [0 1 0 1 0 1]. Adding frequency of occurrence to this can increase the accuracy. These vectors merely indicate the distribution of surrounding words, but as you can see, words with similar vectors clearly have similar meanings. This becomes evident if you examine the vector for the word 'hill' in the same way (This concept originates from Z. Harris's 'Distributional Hypothesis.' Z. Harris was Chomsky's mentor).

Vector Creation with LLMs

Such vectors are created using deep learning tools like Word2Vec, a revolutionary invention by Dr. Mikolov and Dr. Sutskever from Google in 2013. This has evolved further, and now vectors can be created using recent deep learning architectures like transformers, and BERT. Currently, OpenAI has released a vector creation-specific LLM based on GPT3, the basis for ChatGPT, called text-embedding-ada-002, publicly available on OpenAI's cloud. By accessing this via API, one can easily obtain 1536-dimensional vectors of words or sentences. This LLM has been trained on a vast multilingual corpus owned by OpenAI. When you feed it words or sentences, it can discern similarities across languages, an incredibly powerful feature. It also performs quite well with classical Chinese and Japanese, so we will use it for our analysis (Though similar analyses can be done with local vector-creating LLMs, that's a topic for another time).

Vector Searching Waka with Chinese Poetry

Using this capability, for instance, you can directly search waka from Chinese poetry. This method, known as vector search, is commonly used in business to search manuals and FAQs, but it's also useful in literature and linguistics. Here's an example of searching waka in the "Kokinshu" that closely resemble the meaning of a Chinese poem by Fujiwara no Atsumoto from the "Wakan Roeishu." (This is done internally by vectorizing each and then measuring vector similarity using cosine similarity.) This demonstrates how competent the system is in understanding the meanings of classics, including Chinese literature. Next, we will look at the entire collection of "Kokinshu."

query : 池凍東頭風度解窓梅北面雪封寒
rank : waka text(cosine similarity)
1: 梅の花それとも見えず久方の天霧る雪のなべて降れれば (0.83671372975217)
2: 逢坂の嵐の風は寒けれどゆくへ知らねばわびつつぞ寝る (0.8344075388592953)
3: 梅の香の降りおける雪にまがひせば誰かことごとわきて折らまし (0.8313966248472829)
4: 浦ちかく降りくる雪は白波の末の松山越すかとぞ見る (0.8297397632079765)
5: 細枝結ふ葛城山に降る雪の間なく時なく思ほゆるかな (0.8293573454140316)

Principal Component Analysis of Vectors from "Kokinshu"

To understand the semantic structure of the entire "Kokinshu," we first need to vectorize it. However, 1500 dimensions are too many for human comprehension. Such continuous, multidimensional data can be reduced and the essential characteristics extracted through principal component analysis. This time, we compressed the vectors of each poem in "Kokinshu" into two dimensions and mapped them onto the X-axis (first dimension) and Y-axis (second dimension). Each dot represents the vector of each poem, with the maximum and minimum values of songs for each axis displayed. The most significant dimension is the first, followed by the second.


   Image: Two-dimensional scatter plot of "Kokinshu"

Carefully examining each dot (especially the songs at the top and bottom of each axis) reveals the characteristics of each dimension.

(first dimention・high)
(Rank 1): 人を思ふ心は我にあらねばや身のまどふだに知られざるらむ
(Rank 2): 思ひけむ人をぞともに思はましまさしやむくいなかりけりやは
(Rank 3): 身を捨ててゆきやしにけむ思ふよりほかなるものは心なりけり
(first dimention・low)
(Rank 1): 秋ちかう野はなりにけり白露の置ける草葉も色かはりゆく
(Rank 2): 秋の月山辺さやかに照らせるは落つる紅葉のかずを見よとか
(Rank 3): 秋風の吹きと吹きぬる武蔵野はなべて草葉の色かはりけり

For instance, in the first dimension (X-axis), the top-ranking songs depict human emotions, the so-called 'human affairs(人事)' while the bottom-ranking songs depict nature, 'scenic objects'(景物) (autumn is common here, but there are others.

(second dimention・high)
(Rank 1): まかねふく吉備の中山帯にせる細谷川の音のさやけさ
(Rank 2): 郭公声もきこえず山彦は外に鳴く音をこたへやはせぬ
(Rank 3): しほの山さしでの磯にすむ千鳥君が御代をば八千代とぞ鳴く
(second dimention ・low)
(Rank 1): 春ごとに花のさかりはありなめどあひ見むことは命なりけり
(Rank 2): 色見えで移ろふものは世の中の人の心の花にぞありける
(Rank 3): 花見れば心さへにぞ移りける色にはいでじ人もこそ知れ

In contrast, in the second dimension (Y-axis), the top-ranking songs involve sounds of birds or rivers, while the bottom-ranking songs are about flowers. In essence, this creates a 'birds' and 'flowers' opposition axis.

Summarizing this in a simple XY-axis diagram, we get:

       bird
          |
          |
scenic  -------+--------- human
          |
          |
      flower

Surprisingly, it's demonstrable through LLM embedding vectors that 'Human Affairs' and 'Scenic Objects,' as well as 'Birds' and 'Flowers,' form the primary axes of semantic structure in "Kokinshu." It's evident that LLM can 'read' "Kokinshu" correctly. The initial sight of these results was startlingly impressive. Encouraged by this, we also analyzed "Manyoshu."

Principal Component Analysis of Vectors from "Manyoshu"

Like "Kokinshu," we vectorize "Manyoshu" and then examine the first and second principal components. We'll skip the scatter plot but show only the top and bottom-ranking songs.

In the first dimension, like "Manyoshu," there's a division between 'human affairs' and 'scenic objects' (though songs about 'birds' are prominent, there are others).

(first dimention・high)
(Rank 1): 我妹子に恋ふるに我はたまきはる短き命も惜しけくもなし
(Rank 2): 我妹子を相知らしめし人をこそ恋の増されば恨めしみ思へ
(Rank 3): 我妹子に恋ひすべながり夢に見むと我は思へど寝ねらえなくに
(first dimention・low)
(Rank 1): 磯の崎漕ぎ廻み行けば近江の海八十の湊に鶴さはに鳴く
(Rank 2): 舟競ふ堀江の川の水際に来居つつ鳴くは都鳥かも
(Rank 3): 滝の上の三船の山ゆ秋津辺に来鳴き渡るは誰呼子鳥

However, the second dimension is different. In "Manyoshu," the top-ranking songs are predominantly about 'mountains,' while the bottom-ranking ones are about 'seas.'

(second dimention・high)
(Rank 1): 秋山の木の葉もいまだもみたねば今朝吹く風は霜も置きぬべく
(Rank 2): 冬ごもり春さり来れば鳴かざりし鳥も来鳴きぬ咲かざりし花も咲けれど山をしみ入りても取らず草深み取りても見ず秋山の木の葉を見ては黄葉をば取りてそしのふ青きをば置きてそ嘆くそこし恨めし秋山そ我は
(Rank 3): 雪寒み咲きには咲かず梅の花よしこのころはかくてもあるが
(second dimention・low)
(Rank 1): 磯ごとに海人の釣舟泊てにけり我が船泊てむ磯の知らなく
(Rank 2): 奈呉の海人の釣する舟は今こそば舟棚打ちてあへて漕ぎ出め
(Rank 3): 大崎の神の小浜は小さけど百船人も過ぐといはなくに

Summarizing this on the XY-axis, we get:


      mountain
                       |
                       |
scenic ----+--------human
         |
                      |
       sea

That is, in "Manyoshu," the second dimension represents a 'mountain-sea' opposition. This contrasts with the 'bird-flower' opposition in "Kokinshu." This aligns well with one conventional view of waka history, but the clarity of this result underscores the remarkable interpretive power of LLMs.

What caused the shift from "Manyoshu" to "Kokinshu"? - Viewing through the Vector of Chinese Poetry

Having clearly demonstrated the stylistic shift from "Manyoshu" to "Kokinshu" through vector analysis, what could be the cause? Let's consider this through vector analysis as well. The period between "Manyoshu" and "Kokinshu" was the so-called 'dark age of national style,' where Chinese poetry and literature were highly revered. Hence, it makes sense to examine Chinese poetry. For this purpose, we'll use Chinese poetry from the "Wakan Roeishu," a collection favored by people in the Heian period (including a rich selection from the Bai's Anthology), and conduct vector analysis. Here, we only present the result after vectorizing and conducting principal component analysis of the Chinese poetry section from the "Wakan Roeishu."

         visual   
                        |
                        |
scenic-----+-------human
          |
                      |
    auditory

Without an understanding of ancient Chinese, it might be hard to grasp that it aligns this way. The first dimension, like Japanese poetry collections, opposes 'human affairs' and 'scenic objects.' The second dimension contrasts things visible to the eye, like 'grass,' 'trees,' 'humans,' with things audible to the ear, like 'oriole songs,' 'monkey cries,' 'all-around enemy songs,' culminating in the 'visual' vs. 'auditory' opposition shown above.

This is very fitting as an intermediary between "Manyoshu" and "Kokinshu." That is, if we look at the change in the Y-axis (second dimension),

(Manyoshu) (Chinese Poetry) (Kokinshu)

mountain    visual ⇒       flower
                        |
                       |
scenic--------+----------- -human
               |
                       |
sea      auditory ⇒      bird

It's conceivable that Japanese poetry transitioned from the world of "Manyoshu," where poems were sung amidst natural elements like 'mountains' and 'seas,' through the Chinese poetry world's new aesthetic system of 'visual' and 'auditory' senses, to "Kokinshu," where 'flowers' and 'birds' emerged as new, representative materials of this evolved aesthetic sense (Though the Y-axis is reversed, it doesn't pose a statistical problem). Notably, in the Heian period, besides 'flowers' and 'birds,(花鳥)' the aesthetic pair of 'wind' and 'moon' (風月)also existed, with 'wind' being auditory and 'moon' being visual, aligning them with this sequence. Perhaps the semantic roots of the phrase 'flowers, birds, wind, moon' (花鳥風月)lie in this context?

Studying Embedding Vectors Contributes to Understanding AI Thought Processes

Thus, using LLM or AI vectors, we can analyze classical waka, but this isn't limited to just waka. Such analyses are also possible with dictionaries and novels (I presented about dictionaries at the Vocabulary and Dictionary Research Society. Analyses of novels are scheduled for publication in the "Quantitative Linguistics" journal of the Quantitative Linguistics Society. Additionally, various related books are planned for publication. Content expanding on this entry is also planned, so stay tuned. I'll keep you updated on X (Twitter) and other platforms).

At the same time, as we inevitably engage with AI, it's crucial for us to understand how AI perceives this world. There's a question about what humanities should do in the era of AI, and as shown today, researching AI's thought processes and concept understanding, and applying and adapting these to resolve humanities-related issues on the human side, could be one approach. Of course, the pace of AI development is astonishing, and it's uncertain how far human-centric research can go, but these are also things I contemplate as I continue my research.


読んでいただきありがとうございます。ツイートなどしていただけるとうれしいです。