Resilience & Fault Tolerance

This section uses oc commands. If you haven’t already, log in from the Terminal tab — see OpenShift CLI Access.

CertChain uses SmartBFT (Byzantine Fault Tolerant) consensus, which provides stronger guarantees than traditional Raft — tolerating not just crash failures but also malicious nodes.

BFT Consensus Theory

The Formula: 3f + 1

BFT consensus requires 3f + 1 total nodes to tolerate f Byzantine faults:

Total Orderers Faults Tolerated Fault Type

4 (this demo)

1

Byzantine (crash or malicious)

7

2

Byzantine

10

3

Byzantine

With 4 orderers, the network can tolerate 1 orderer being offline, unresponsive, or actively sending incorrect data — and still reach consensus.

Why BFT Over Raft?

Raft (used in Fabric 2.x) tolerates crash faults only. If a Raft node starts sending corrupted data, the network has no protection. BFT handles both:

  • Crash faults — Node stops responding (same as Raft)

  • Byzantine faults — Node sends incorrect data, delays messages, or acts maliciously

In a multi-organization consortium where trust is limited, BFT is the appropriate consensus model.

Demo: Kill an Orderer

This demo proves the network continues operating with one orderer down.

Step 1: Verify Normal Operation

First, confirm transactions work:

# Get a token
TOKEN=$(curl -sk -X POST \
  "https:///realms/techpulse/protocol/openid-connect/token" \
  -d "grant_type=password" \
  -d "client_id=course-manager-ui" \
  -d "username=admin@techpulse.demo" \
  -d "password=admin" | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

# Generate a unique cert ID using timestamp
CERT_ID="RESIL-$(date +%s)"

# Issue a certificate
curl -sk -X POST \
  "https:///api/v1/certificates" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d "{
    \"certID\": \"$CERT_ID\",
    \"studentID\": \"student01@techpulse.demo\",
    \"studentName\": \"Resilience Test\",
    \"courseID\": \"TEST-101\",
    \"courseName\": \"BFT Testing\",
    \"issueDate\": \"2026-03-14\",
    \"expiryDate\": \"2027-03-14\"
  }" | python3 -m json.tool
echo "Issued certificate: $CERT_ID"

Step 2: Pause ArgoCD and Scale Down an Orderer

ArgoCD auto-sync would restore the orderer automatically — pause it first:

oc patch application certchain-dataforge -n openshift-gitops --type=merge \
  -p '{"spec":{"syncPolicy":{"automated":null}}}'

Now scale down the DataForge orderer:

oc scale deployment orderer -n certchain-dataforge --replicas=0

Step 3: Verify Transactions Still Work

# New unique cert ID — proves transactions work with one orderer down
CERT_ID2="RESIL-BFT-$(date +%s)"

curl -sk -X POST \
  "https:///api/v1/certificates" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d "{
    \"certID\": \"$CERT_ID2\",
    \"studentID\": \"student01@techpulse.demo\",
    \"studentName\": \"Resilience Test 2\",
    \"courseID\": \"TEST-101\",
    \"courseName\": \"BFT Testing\",
    \"issueDate\": \"2026-03-14\",
    \"expiryDate\": \"2027-03-14\"
  }" | python3 -m json.tool
echo "Issued certificate with orderer down: $CERT_ID2"

The transaction should succeed — 3 out of 4 orderers is sufficient for BFT consensus.

Step 4: Restore the Orderer and Watch Recovery

oc scale deployment orderer -n certchain-dataforge --replicas=1

Watch the orderer logs — you’ll see it rejoin the BFT cluster and sync missed blocks:

# Wait for the pod to start, then watch SmartBFT recovery
sleep 5
oc logs deployment/orderer -n certchain-dataforge 2>/dev/null | grep -i "smartbft\|sequence\|decision\|startView\|commit"

Look for these key log messages:

  • Last proposal with sequence N has been safely committed — orderer replayed its WAL

  • Starting view with number 0, sequence N, and decisions M — orderer rejoined consensus

  • Changing to follower role, current leader: X — cluster accepted the returning node

Re-enable ArgoCD:

oc patch application certchain-dataforge -n openshift-gitops --type=merge \
  -p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'

Step 5: Verify the recovered orderer has the new block

# The certificate we issued while the orderer was down should be verifiable
curl -sk "https:///api/v1/verify/$CERT_ID2" | python3 -m json.tool

If status is VALID, the orderer caught up with the block that was committed while it was offline.

Open the Grafana dashboard (Open) during this test — you’ll see the BFT Committed Block Number panel show the restored orderer catching up to match the others.

Demo: Peer Recovery

Each organization’s peer maintains a full copy of the ledger and the CouchDB state database. This demo proves that when a peer goes down, the blockchain keeps working through other orgs, and the recovered peer syncs missed blocks automatically via gossip.

Step 1: Record block height and take down the peer

Record the current ledger height on the TechPulse peer, then take it offline:

# Record block height BEFORE the outage
echo "=== TechPulse block height BEFORE peer down ==="
oc exec deployment/peer0 -n certchain-techpulse -- \
  peer channel getinfo -c certchannel 2>/dev/null | grep -o '{.*}' | python3 -m json.tool

# Pause ArgoCD auto-sync (it would restore the peer automatically)
oc patch application certchain-techpulse -n openshift-gitops --type=merge \
  -p '{"spec":{"syncPolicy":{"automated":null}}}'

# Kill the TechPulse peer
oc scale deployment peer0 -n certchain-techpulse --replicas=0
echo ""
echo "TechPulse peer is DOWN"

Note the height value — after recovery, it should be higher.

Step 2: Issue a certificate through DataForge (while TechPulse is down)

The blockchain doesn’t stop — other orgs can still transact:

# Get a DataForge token
DF_TOKEN=$(curl -sk -X POST \
  "https:///realms/dataforge/protocol/openid-connect/token" \
  -d "grant_type=password" \
  -d "client_id=course-manager-ui" \
  -d "username=admin@dataforge.demo" \
  -d "password=admin" | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

# Issue a certificate through DataForge while TechPulse peer is offline
PEER_TEST_ID="PEER-RECOVERY-$(date +%s)"
curl -sk -X POST \
  "https:///api/v1/certificates" \
  -H "Authorization: Bearer $DF_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{
    \"certID\": \"$PEER_TEST_ID\",
    \"studentID\": \"student01@dataforge.demo\",
    \"studentName\": \"Peer Recovery Test\",
    \"courseID\": \"DF-101\",
    \"courseName\": \"DataForge Testing\",
    \"issueDate\": \"2026-03-14\",
    \"expiryDate\": \"2027-03-14\"
  }" | python3 -m json.tool
echo "Issued certificate via DataForge: $PEER_TEST_ID"

This transaction was endorsed by the DataForge peer and committed to the blockchain — TechPulse’s peer missed this block entirely.

Step 3: Bring TechPulse peer back and watch it sync

oc scale deployment peer0 -n certchain-techpulse --replicas=1
echo "Waiting for peer to start..."
oc rollout status deployment peer0 -n certchain-techpulse --timeout=60s

Now wait for the peer to sync and display proof — a polling loop checks for the Committed block log entry that confirms the peer caught up:

# Poll until the peer commits the missed block(s)
echo "Waiting for peer to sync missed blocks..."
for i in $(seq 1 12); do
  COMMITTED=$(oc logs deployment/peer0 -n certchain-techpulse 2>/dev/null \
    | grep "Committed block")
  if [ -n "$COMMITTED" ]; then
    echo ""
    echo "=== Peer block sync proof ==="
    oc logs deployment/peer0 -n certchain-techpulse 2>/dev/null \
      | grep -E "Committed block|Received block|Joining gossip"
    break
  fi
  sleep 5
done

echo ""
echo "=== Block height AFTER recovery ==="
oc exec deployment/peer0 -n certchain-techpulse -- \
  peer channel getinfo -c certchannel 2>/dev/null | grep -o '{.*}' | python3 -m json.tool

You should see output like:

Joining gossip network of channel certchannel with 3 organizations   (1)
Received block [N] from buffer channel=certchannel                   (2)
Committed block [N] with 1 transaction(s) in 246ms                   (3)
1 Peer rejoined the gossip network
2 Peer received the missed block from the ordering service
3 Peer validated and committed the block to its ledger

Compare the height to Step 1 — it should be higher (the peer caught up).

Step 4: Prove the peer has the new data

The certificate issued through DataForge should now be queryable through TechPulse’s peer:

echo "=== Verifying cert issued during TechPulse outage ==="
curl -sk "https:///api/v1/verify/$PEER_TEST_ID" | python3 -m json.tool

If status is VALID, the TechPulse peer successfully synced the block that was committed while it was offline. The certificate issued through DataForge is now visible across the entire network.

Re-enable ArgoCD:

oc patch application certchain-techpulse -n openshift-gitops --type=merge \
  -p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'

Demo: CouchDB State Rebuild

CouchDB is the state database — a queryable view of the latest ledger state. The blockchain (ledger) is the source of truth; CouchDB is just an index built from it.

This demo proves that by completely destroying the state databases and using Fabric’s peer node rebuild-dbs command to reconstruct them from the immutable ledger.

Step 1: Record current state

Record the CouchDB document count before we destroy it:

echo "=== BEFORE: CouchDB certcontract database ==="
oc exec deployment/couchdb -n certchain-techpulse -- \
  curl -s -u admin:adminpw http://localhost:5984/certchannel_certcontract 2>/dev/null \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'doc_count: {d[\"doc_count\"]}')"

echo ""
echo "=== BEFORE: Verify a known certificate ==="
curl -sk "https:///api/v1/verify/TP-FSWD-001" | python3 -m json.tool

Note the doc_count — after the rebuild it should match exactly.

Step 2: Pause ArgoCD and scale down the peer

oc patch application certchain-techpulse -n openshift-gitops --type=merge \
  -p '{"spec":{"syncPolicy":{"automated":null}}}'

oc scale deployment peer0 -n certchain-techpulse --replicas=0
echo "Peer is DOWN"

Step 3: Delete the CouchDB state database

With the peer stopped, we simulate catastrophic data corruption by dropping the chaincode state database entirely:

# Delete the certcontract state database from CouchDB
oc exec deployment/couchdb -n certchain-techpulse -- \
  curl -s -u admin:adminpw -X DELETE http://localhost:5984/certchannel_certcontract
echo ""

# Verify it's gone
echo "=== State database deleted ==="
oc exec deployment/couchdb -n certchain-techpulse -- \
  curl -s -u admin:adminpw http://localhost:5984/certchannel_certcontract 2>/dev/null

You should see {"error":"not_found","reason":"Database does not exist."}.

Step 4: Run rebuild-dbs to mark state for reconstruction

Fabric provides peer node rebuild-dbs — it marks all state databases for reconstruction. On next startup, the peer will replay every block from the ledger into fresh CouchDB databases.

The command must run while the peer is stopped. We use oc debug to create a temporary pod with the peer’s volumes mounted:

oc debug deployment/peer0 -n certchain-techpulse -- peer node rebuild-dbs

You should see:

  • Dropping CouchDB application databases — clears all remaining state DBs

  • Dropping all contents in StateLevelDB — clears internal state indexes

  • DeleteBlockStoreIndex — clears block index (will be rebuilt too)

Step 5: Start the peer and watch it rebuild

oc scale deployment peer0 -n certchain-techpulse --replicas=1
echo "Waiting for peer to start..."
oc rollout status deployment peer0 -n certchain-techpulse --timeout=90s

Watch the rebuild — the peer recreates every CouchDB database and replays blocks:

sleep 10
echo "=== Peer rebuild activity ==="
oc logs deployment/peer0 -n certchain-techpulse 2>/dev/null \
  | grep -E "Created state database|syncIndex|Committed block|Opened ledger"

You should see:

  • Finished building index. Last block indexed [N] — block index rebuilt from ledger files

  • Created state database certchannel_certcontract — CouchDB database recreated

  • Created state database certchannel_ — system state databases recreated

  • Opened ledger with id = certchannel — peer is back online

Step 6: Verify full state recovery

echo "=== AFTER: CouchDB certcontract database ==="
oc exec deployment/couchdb -n certchain-techpulse -- \
  curl -s -u admin:adminpw http://localhost:5984/certchannel_certcontract 2>/dev/null \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'doc_count: {d[\"doc_count\"]}')"

echo ""
echo "=== AFTER: Verify the same certificate ==="
curl -sk "https:///api/v1/verify/TP-FSWD-001" | python3 -m json.tool

The doc_count should match the BEFORE value exactly, and the certificate should show status: "VALID" — the peer rebuilt the entire state from the blockchain ledger.

Re-enable ArgoCD:

oc patch application certchain-techpulse -n openshift-gitops --type=merge \
  -p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'

This is a key architectural principle: the blockchain ledger is the immutable source of truth. CouchDB (or any state database) is a derived view that can be reconstructed at any time by replaying the ledger using peer node rebuild-dbs. This is why Fabric doesn’t need traditional database backups — the ledger is the backup.

Production Considerations

Consideration Recommendation

Orderer count

Use 7 orderers (3f+1 = 7 for f=2) in production for better fault tolerance

Peer redundancy

Run 2+ peers per org with RWX storage for endorsement availability

Storage

Use ReadWriteMany (RWX) storage for peer PVCs to enable horizontal scaling

Backup

Snapshot CouchDB and ledger PVCs regularly

Cross-region

Distribute orderers across availability zones for geographic resilience

Monitoring

Set alerts on consensus view changes, endorsement failures, and peer gossip metrics