共起頻度行列を使って静的なブログ記事のタグ同士の関連性を計算する方法

背景

ブログは以下のような構成でタグの紐付けをしている。

ブログ記事になる Markdown ファイル

Front Matter でタグを指定

---
title: '記事タイトル'
tags:
  - タグ1
  - タグ2
---

Next.js でビルドする前に全記事（Markdown ファイル）からタグ一覧を生成
下記のフォーマットの JSON を生成して記事とタグの紐付けを管理

{
  "タグ名": ["記事スラッグ", "記事スラッグ", "記事スラッグ"],
  "タグ名": ["記事スラッグ", "記事スラッグ", "記事スラッグ"],
  "タグ名": ["記事スラッグ", "記事スラッグ", "記事スラッグ"],
  ...
}

関連付け方法

記事とタグの関連性を計算するのに共起頻度行列を使うと良さそうだったので、共起頻度行列を使ってみる。

共起頻度行列は、自然言語処理の分野で文書や単語の分析によく用いられているようである。

共起頻度行列とは

共起頻度行列とは、複数の事象について、同時に出現する回数を表にまとめたものである。

共起頻度行列 (たんに共起行列)とは,コーパス中の単語の共起頻度を要素とする行列である.
[引用:PDF] 対数共起頻度を用いた四項類推: word2vec と PMI との比較

たとえば、ある文書集合が与えられた場合、文書中に出現する単語を抽出して、単語の出現頻度を数える。そして、同じ文書内に共起こする単語の組み合わせの回数を表にまとめたものが、共起頻度行列である。

たとえば、次のような文書がある場合、以下のようになる。

文書1：りんごとみかんが好きだ
文書2：みかんとバナナが好きだ
文書3：バナナとりんごが嫌いだ

この文書から抽出した単語は、以下の 3 つである。

['りんご', 'みかん', 'バナナ'];

これらの単語を表にまとめたものが次のようになる（共起頻度行列）。

      りんご  みかん  バナナ
    +---------------------
りんご |   2     1      1
みかん |   1     2      1
バナナ |   1     1      2

行と列に出現した単語が、同じ文書に出現した回数が表されている。たとえば、「りんご」と「みかん」は 1 つの文書にともに 2 回出現しているため、行「りんご」、列「みかん」のセルには 2 が入っている。

実装

// 記事に紐付いたタグを格納したオブジェクト（簡略化したもの）
type PostsProps = {
  slug: string;
  tags: string[];
};

const posts: PostsProps = [
  {
    slug: '202303282336',
    tags: ['HTML'],
  },
  {
    slug: '202303191657',
    tags: ['設計', 'JavaScript', 'TypeScript', 'ESLint'],
  },
];

// 全てのタグ一覧（簡略化したもの）
type TagsListProps = Record<string, string[]>;

const tags: TagsListProps = {
  HTML: ['20131024100432', '20131021115803', '20131017103737', '20130110131343', '20121119162408'],
  設計: ['20121205112442', '20121126130658'],
  ブラウザ: ['20130821162938', '20130302170449'],
};

// 共起頻度行列
const coOccurrenceMatrix = {};

posts.forEach((article) => {
  const tags = article.tags;
  tags.forEach((tag) => {
    if (!coOccurrenceMatrix[tag]) {
      coOccurrenceMatrix[tag] = {};
    }
    tags.forEach((otherTag) => {
      if (tag !== otherTag) {
        if (!coOccurrenceMatrix[tag][otherTag]) {
          coOccurrenceMatrix[tag][otherTag] = 1;
        } else {
          coOccurrenceMatrix[tag][otherTag]++;
        }
      }
    });
  });
});

// タグ関連性を計算する
const tagRelations = {};

for (const tag in coOccurrenceMatrix) {
  for (const otherTag in coOccurrenceMatrix[tag]) {
    const count = coOccurrenceMatrix[tag][otherTag];
    if (!tags[tag]) {
      continue;
    }
    const tagArticleCount = tags[tag].length;
    if (!tags[otherTag]) {
      continue;
    }
    const otherTagArticleCount = tags[otherTag].length;
    const relation = count / Math.sqrt(tagArticleCount * otherTagArticleCount);
    if (!tagRelations[tag]) {
      tagRelations[tag] = {};
    }
    tagRelations[tag][otherTag] = relation;
  }
}

// 数値の高い順に並び替え、数値を切り上げ
const sortedTags = {};

for (const tag in tagRelations) {
  const sortedRelatedTags = {};
  const relatedTagEntries = Object.entries(tagRelations[tag]);
  relatedTagEntries.sort((a: [string, number], b: [string, number]) => b[1] - a[1]);
  relatedTagEntries.forEach(([key, value]: [string, number]) => {
    sortedRelatedTags[key] = Number(value.toFixed(4));
  });
  sortedTags[tag] = sortedRelatedTags;
}

console.log(sortedTags);

各記事に紐付いたタグどうしの関連性を調べるための共起頻度行列を作成
各タグどうしの関連性を計算して、tagRelationsオブジェクトに格納
タグ同士の関連性は、各タグが出現する記事数の平方根で割り算出する

結果

{
  "HTML": {
    "アクセシビリティ": 0.3111,
    "SEO": 0.1606,
    "CSS": 0.1589,
    "Internet Explorer": 0.127,
    "JavaScript": 0.1261,
    "Advent Calendar": 0.0803,
    "Google": 0.0679,
    "Firefox": 0.0449
  },
  "設計": {
    "ITCSS": 0.6547,
    "CSS": 0.2229,
    "ESLint": 0.2182,
    "Vuex": 0.2182,
    "TypeScript": 0.189,
    "Vue.js": 0.114,
    "JavaScript": 0.0442
  }
}

おわり

これらの指標を使って、記事に「関連タグ」を表示させることができるようになった。