How To Identify OSD(s) affected by PG Dup Bug

follow-up to osd(s) with unlimited ram growth

There is a way to check a ceph cluster if there are any OSDs affected by the “PG Dup Bug” by running following command:

ceph tell osd.\* perf dump |grep 'osd_pglog\|^osd\.[0-9]'

This will provide you a list of all OSDs in the cluster containing 2 Parameters:

  1. osd_pglog_bytes
  2. osd_pglog_items

“osd_pglog_items” counter is a sum of “normal” log entries, dup entries and some other things. Taking that osd_target_pg_log_entries_per_osd is 300.000 by default, we may assume that about 300.000 items are “normal” pg log entries, and if “osd_pglog_items” counter is much higher than this it is most likely due to dups. Example:

osd.32: { "osd_pglog_bytes": 1925908608, "osd_pglog_items": 17418324 }

osd_pglog_items = 17.418.324 – 300.000 = probably about 17 Million PG Dups

Running a manual check against this OSD with the commands in osd(s) with unlimited ram growth revealed 1 PG with 17.090.093 entries. So this is a quick and easy way to identify problematic OSD(s) without the need to stop all OSDs and manually run commands.

Sources:

https://github.com/ceph/ceph/blob/master/src/common/options/global.yaml.in#L2951

WordPress Cookie Notice by Real Cookie Banner