[備忘録]misskeyのDBが吹っ飛んだ時に読むnote

2024年4月24日 20:34

前書き

このnoteは、上記noteに記したmisskey鯖のバックエンドとして動作しているDB(postgres)が吹っ飛んだときの復旧手順をまとめたものである。

早くも2度復旧作業をすることになったため、こうしてnoteにまとめておくことを思い立った。

復旧手順

VMの復旧

DBが稼働しているVMごとお亡くなりになった場合、DBが動作するVMをまず復旧しなければならない。
構築用のtfファイルおよびplaybookを用いてVMの再構築を行う。

ironawi@ironawi-ally:~$ cd terraform/kubernetes/
ironawi@ironawi-ally:~/terraform/kubernetes$ ls
kubernetes_node.tf  modules  output.tf  sg-k8s-master.tf  sg-k8s-worker.tf  sg-misskey...tf  terraform.tfstate  terraform.tfstate.backup
ironawi@ironawi-ally:~/terraform/kubernetes$ terraform plan
...
Note: Objects have changed outside of Terraform

Terraform detected the following changes made outside of Terraform since the last "terraform apply" which may have affected this plan:

  # module.worker_node2.aws_instance.main has been deleted
  - resource "aws_instance" "main" {
        id                                   = "i-0891cba0a7c60c5fe"
      - public_dns                           = "ec2-15-152-119-16.ap-northeast-3.compute.amazonaws.com" -> null
        tags                                 = {
            "Name" = "worker_node2"
        }
        # (33 unchanged attributes hidden)

        # (9 unchanged blocks hidden)
    }


Unless you have made equivalent changes to your configuration, or ignored the relevant attributes using ignore_changes, the following plan may include actions to undo or respond to these changes.

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # module.worker_node2.aws_instance.main will be created
  + resource "aws_instance" "main" {
      + ami                                  = "ami-0c1531991482a24e1"
      + arn                                  = (known after apply)
      + associate_public_ip_address          = (known after apply)
      + availability_zone                    = (known after apply)
      + cpu_core_count                       = (known after apply)
      + cpu_threads_per_core                 = (known after apply)
      + disable_api_stop                     = (known after apply)
      + disable_api_termination              = (known after apply)
      + ebs_optimized                        = (known after apply)
      + get_password_data                    = false
      + host_id                              = (known after apply)
      + host_resource_group_arn              = (known after apply)
      + iam_instance_profile                 = (known after apply)
      + id                                   = (known after apply)
      + instance_initiated_shutdown_behavior = (known after apply)
      + instance_lifecycle                   = (known after apply)
      + instance_state                       = (known after apply)
      + instance_type                        = "t3.small"
      + ipv6_address_count                   = (known after apply)
      + ipv6_addresses                       = (known after apply)
      + key_name                             = "yaiwata-dev-northeast3"
      + monitoring                           = (known after apply)
      + outpost_arn                          = (known after apply)
      + password_data                        = (known after apply)
      + placement_group                      = (known after apply)
      + placement_partition_number           = (known after apply)
      + primary_network_interface_id         = (known after apply)
      + private_dns                          = (known after apply)
      + private_ip                           = (known after apply)
      + public_dns                           = (known after apply)
      + public_ip                            = (known after apply)
      + secondary_private_ips                = (known after apply)
      + security_groups                      = (known after apply)
      + source_dest_check                    = true
      + spot_instance_request_id             = (known after apply)
      + subnet_id                            = "subnet-0988129ff19aad0e4"
      + tags                                 = {
          + "Name" = "worker_node2"
        }
      + tags_all                             = {
          + "Name" = "worker_node2"
        }
      + tenancy                              = (known after apply)
      + user_data                            = (known after apply)
      + user_data_base64                     = (known after apply)
      + user_data_replace_on_change          = false
      + vpc_security_group_ids               = [
          + "sg-03bf3ff0d56c8f475",
          + "sg-0fc354d2e7824d1e6",
        ]

      + instance_market_options {
          + market_type = "spot"

          + spot_options {
              + instance_interruption_behavior = (known after apply)
              + max_price                      = "0.01"
              + spot_instance_type             = (known after apply)
              + valid_until                    = (known after apply)
            }
        }

      + root_block_device {
          + delete_on_termination = true
          + device_name           = (known after apply)
          + encrypted             = (known after apply)
          + iops                  = (known after apply)
          + kms_key_id            = (known after apply)
          + throughput            = (known after apply)
          + volume_id             = (known after apply)
          + volume_size           = 30
          + volume_type           = "gp3"
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  ~ worker_node2 = "ec2-15-152-119-16.ap-northeast-3.compute.amazonaws.com" -> (known after apply)

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Note: You didn't use the -out option to save this plan, so Terraform can't guarantee to take exactly these actions if you run "terraform apply" now.

まずはterraform planで状態の確認を行う。出力結果からはworkerの1台が消滅しており、消滅したworkerが再作成されることが分かる。
状態を確認したので、terraform applyでVMの再作成を行う。

ironawi@ironawi-ally:~/terraform/kubernetes$ terraform apply
...
Note: Objects have changed outside of Terraform

Terraform detected the following changes made outside of Terraform since the last "terraform apply" which may have affected this plan:

  # module.worker_node2.aws_instance.main has been deleted
  - resource "aws_instance" "main" {
        id                                   = "i-0891cba0a7c60c5fe"
      - public_dns                           = "ec2-15-152-119-16.ap-northeast-3.compute.amazonaws.com" -> null
        tags                                 = {
            "Name" = "worker_node2"
        }
        # (33 unchanged attributes hidden)

        # (9 unchanged blocks hidden)
    }


Unless you have made equivalent changes to your configuration, or ignored the relevant attributes using ignore_changes, the following plan may include actions to undo or respond to these changes.

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # module.worker_node2.aws_instance.main will be created
  + resource "aws_instance" "main" {
      + ami                                  = "ami-0c1531991482a24e1"
      + arn                                  = (known after apply)
      + associate_public_ip_address          = (known after apply)
      + availability_zone                    = (known after apply)
      + cpu_core_count                       = (known after apply)
      + cpu_threads_per_core                 = (known after apply)
      + disable_api_stop                     = (known after apply)
      + disable_api_termination              = (known after apply)
      + ebs_optimized                        = (known after apply)
      + get_password_data                    = false
      + host_id                              = (known after apply)
      + host_resource_group_arn              = (known after apply)
      + iam_instance_profile                 = (known after apply)
      + id                                   = (known after apply)
      + instance_initiated_shutdown_behavior = (known after apply)
      + instance_lifecycle                   = (known after apply)
      + instance_state                       = (known after apply)
      + instance_type                        = "t3.small"
      + ipv6_address_count                   = (known after apply)
      + ipv6_addresses                       = (known after apply)
      + key_name                             = "yaiwata-dev-northeast3"
      + monitoring                           = (known after apply)
      + outpost_arn                          = (known after apply)
      + password_data                        = (known after apply)
      + placement_group                      = (known after apply)
      + placement_partition_number           = (known after apply)
      + primary_network_interface_id         = (known after apply)
      + private_dns                          = (known after apply)
      + private_ip                           = (known after apply)
      + public_dns                           = (known after apply)
      + public_ip                            = (known after apply)
      + secondary_private_ips                = (known after apply)
      + security_groups                      = (known after apply)
      + source_dest_check                    = true
      + spot_instance_request_id             = (known after apply)
      + subnet_id                            = "subnet-0988129ff19aad0e4"
      + tags                                 = {
          + "Name" = "worker_node2"
        }
      + tags_all                             = {
          + "Name" = "worker_node2"
        }
      + tenancy                              = (known after apply)
      + user_data                            = (known after apply)
      + user_data_base64                     = (known after apply)
      + user_data_replace_on_change          = false
      + vpc_security_group_ids               = [
          + "sg-03bf3ff0d56c8f475",
          + "sg-0fc354d2e7824d1e6",
        ]

      + instance_market_options {
          + market_type = "spot"

          + spot_options {
              + instance_interruption_behavior = (known after apply)
              + max_price                      = "0.01"
              + spot_instance_type             = (known after apply)
              + valid_until                    = (known after apply)
            }
        }

      + root_block_device {
          + delete_on_termination = true
          + device_name           = (known after apply)
          + encrypted             = (known after apply)
          + iops                  = (known after apply)
          + kms_key_id            = (known after apply)
          + throughput            = (known after apply)
          + volume_id             = (known after apply)
          + volume_size           = 30
          + volume_type           = "gp3"
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  ~ worker_node2 = "ec2-15-152-119-16.ap-northeast-3.compute.amazonaws.com" -> (known after apply)

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

module.worker_node2.aws_instance.main: Creating...
module.worker_node2.aws_instance.main: Still creating... [10s elapsed]
module.worker_node2.aws_instance.main: Provisioning with 'local-exec'...
module.worker_node2.aws_instance.main (local-exec): Executing: ["/bin/sh" "-c" "modules/ec2/../scripts/check_ssh_connection.sh <host name>"]
module.worker_node2.aws_instance.main (local-exec): checking ssh connection...
...
module.worker_node2.aws_instance.main (local-exec): ssh connection established!
module.worker_node2.aws_instance.main: Provisioning with 'local-exec'...
module.worker_node2.aws_instance.main (local-exec): Executing: ["/bin/sh" "-c" "ansible-playbook -i <host name>, modules/ec2/../ansible/setup_k8s.yaml"]

module.worker_node2.aws_instance.main (local-exec): PLAY [all] *********************************************************************
...
module.worker_node2.aws_instance.main: Creation complete after 1m19s [id=<instance id>]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

master_node = "<host name>"
worker_node1 = "<host name>"
worker_node2 = "<host name>"
worker_node3 = "<host name>"

問題なく構築完了すれば、上記のような出力でコマンドが終了する。

残ってしまったリソースの整理およびDB復旧のための事前準備

VMが突然お亡くなりになったので、VMに紐づいていたk8sリソースは残ったままになっている。
control plane nodeにsshログインし、リソースの整理を行う。

まず、dbにアクセスしているmisskeyのwebサーバを停止する。
deploymentをeditし、replicasの値を0にして保存する。

kubectl edit deploy -n misskey web-deployment

webサーバのPodがいなくなったことを確認し、突然お亡くなりになったVMに紐づいていたnodeのdrainを行い、nodeを削除する。

kubectl get po -n misskey
kubectl get node
kubectl drain --ignore-daemonsets --force <node名>
kubectl delete node <node名>

新規構築したk8s worker nodeのクラスタjoin

上記ページのワーカーノードの作成を参考に、worker nodeのkubeletの再起動までを実施。

上記ページのトークンを作成を参考に、control plane nodeでクラスタjoin用のトークンを再発行。
表示されたkubeadm joinコマンドをworker nodeで実行することで、k8sクラスタへworker nodeが追加される。

postgres起動およびバックアップからの復旧

control plane nodeに置いてあるDB用manifestをapplyし、DBを再起動。

kubectl apply -f db.yaml 
kubectl get po -n misskey

db Podの起動を確認できたら、バックアップファイルをPod内へ送りこむ。

kubectl cp /k8s/misskey/backup/<バックアップファイル>.tar.gz misskey/db:/

kubectl exec でPod内に入り、送り込んだtar.gzを解凍してバックアップファイルを取り出してDBを復旧する。

kubectl exec -it -n misskey db -- /bin/bash
tar zxf <バックアップファイル>.tar.gz
psql -U misskey-user misskey < tmp/backup/dump.sql

redis起動およびバックアップからの復旧

redisが落ちた場合は、バックアップしておいたdump.rdbを /k8s/misskey/redis/ に置き直してredisを再起動すればOK。

misskey再起動

DBの復旧が完了したら、misskeyのwebサーバを再開する。
deploymentのreplicasを元に戻せばOK。

この記事が気に入ったらサポートをしてみませんか？