go: keepalive, exponential backoff, chain_id metadata, durability guarantees

Three related fixes that turn the go template into a client that
survives the full matrix of server restart, client restart, network
blip, half-open TCP, and long outages (hours → months) — without the
user writing a line of reconnect logic in process.go.

1. gRPC keepalive: Time=10s, Timeout=3s, PermitWithoutStream=true.
   Half-open TCP (silent server restart, resumed laptop, NAT drop)
   is detected within ~13s. Previously the OS TCP keepalive took
   ~2h to notice, leaving the client as a ghost stream while prime
   logged "no active gRPC connection" for every skipped transaction.

2. Exponential backoff with jitter on reconnect. Effective delay =
   min(max_backoff_seconds, reconnect_delay_seconds * 2^attempts)
   + random(0, reconnect_delay_seconds). The attempts counter resets
   after any session that runs healthy for 60+ seconds. Jitter
   desynchronises clients so a server restart doesn't trigger a
   thundering herd. New max_backoff_seconds config field, default 120.

3. Unified error signalling: the sender goroutine now tears down the
   stream's context when it hits a Send error. Previously only Recv
   errors triggered a reconnect — a stale stream where only Send was
   broken could sit there indefinitely.

Also: chain_id is a required config field now and goes in the
x-chain-id gRPC metadata header alongside x-api-key and
x-smart-contract-id. Prime rejects streams without it with "missing
chain ID", which was silently breaking every template-based client
until users discovered it the hard way. README documents the
durability contract so contract authors know they don't have to
reimplement any of it.
This commit is contained in:
2026-04-19 21:23:47 -04:00
parent 0634e66469
commit 2bc57c073d
3 changed files with 173 additions and 30 deletions

View File

@@ -43,6 +43,7 @@ make tools
4. **Configure your connection** by editing `config.yaml`: 4. **Configure your connection** by editing `config.yaml`:
```yaml ```yaml
server_address: "your-dragonchain-server:50051" server_address: "your-dragonchain-server:50051"
chain_id: "your-chain-public-id"
smart_contract_id: "your-smart-contract-id" smart_contract_id: "your-smart-contract-id"
api_key: "your-api-key" api_key: "your-api-key"
``` ```
@@ -60,13 +61,24 @@ make tools
| Field | Description | Default | | Field | Description | Default |
|-------|-------------|---------| |-------|-------------|---------|
| `server_address` | gRPC server address | Required | | `server_address` | gRPC server address | Required |
| `chain_id` | Public chain id the SC is registered on (sent as `x-chain-id` metadata) | Required |
| `smart_contract_id` | Your smart contract ID | Required | | `smart_contract_id` | Your smart contract ID | Required |
| `api_key` | API key for authentication | Required | | `api_key` | API key for authentication | Required |
| `use_tls` | Enable TLS encryption | `false` | | `use_tls` | Enable TLS encryption | `false` |
| `tls_cert_path` | Path to TLS certificate | - | | `tls_cert_path` | Path to TLS certificate | - |
| `num_workers` | Concurrent transaction processors | `10` | | `num_workers` | Concurrent transaction processors | `10` |
| `reconnect_delay_seconds` | Delay between reconnection attempts | `5` | | `reconnect_delay_seconds` | Base delay for exponential backoff between reconnect attempts | `3` |
| `max_reconnect_attempts` | Max reconnect attempts (0 = infinite) | `0` | | `max_backoff_seconds` | Ceiling for the exponential backoff | `120` |
| `max_reconnect_attempts` | Max reconnect attempts (0 = infinite, recommended) | `0` |
## Durability guarantees (provided by `main.go`, no work for you)
- **Server restart, update, crash, or network blip** → the client auto-reconnects and resumes processing. Transactions observed while the stream was down stay queued on the Dragonchain Prime side and are delivered (oldest first) on reconnect.
- **Client restart or long outage** → when this process comes back up (minutes, hours, months later), it rejoins the stream and prime re-delivers every still-pending transaction that should have invoked it.
- **Half-open TCP** (silent peer, resumed laptop, corporate NAT dropping idle flows) is detected within ~13 seconds via gRPC keepalive and triggers a reconnect. No dangling ghost streams.
- **Reconnect storms** are avoided: exponential backoff with jitter means many clients reconnecting after a server restart don't all slam `accept()` at the same instant. The timer resets after a stream has been healthy for 60 seconds.
These are invariants of the template — you do not add any of this in `process.go`.
## Implementing Your Smart Contract ## Implementing Your Smart Contract

View File

@@ -4,6 +4,11 @@
# The gRPC server address to connect to # The gRPC server address to connect to
server_address: "localhost:50051" server_address: "localhost:50051"
# The public chain id on which this smart contract is registered.
# Sent as the x-chain-id gRPC metadata header — prime rejects streams
# without it.
chain_id: "your-chain-public-id"
# Your smart contract ID (provided by Dragonchain) # Your smart contract ID (provided by Dragonchain)
smart_contract_id: "your-smart-contract-id" smart_contract_id: "your-smart-contract-id"
@@ -19,6 +24,12 @@ use_tls: false
# Number of worker goroutines for processing transactions concurrently # Number of worker goroutines for processing transactions concurrently
num_workers: 10 num_workers: 10
# Reconnect settings # Reconnect settings. The client uses exponential backoff with jitter:
reconnect_delay_seconds: 5 # effective delay = min(max_backoff_seconds, reconnect_delay_seconds * 2^attempts) + random(0, reconnect_delay_seconds).
# Keep max_reconnect_attempts at 0 (infinite) unless you have a specific
# reason to stop — the client is designed to survive arbitrarily long
# outages and resume processing from the prime-side queue when the
# server returns.
reconnect_delay_seconds: 3
max_backoff_seconds: 120
max_reconnect_attempts: 0 # 0 = infinite retries max_reconnect_attempts: 0 # 0 = infinite retries

View File

@@ -7,6 +7,7 @@ import (
"fmt" "fmt"
"io" "io"
"log" "log"
"math/rand"
"os" "os"
"os/signal" "os/signal"
"sync" "sync"
@@ -17,6 +18,7 @@ import (
"google.golang.org/grpc" "google.golang.org/grpc"
"google.golang.org/grpc/credentials" "google.golang.org/grpc/credentials"
"google.golang.org/grpc/credentials/insecure" "google.golang.org/grpc/credentials/insecure"
"google.golang.org/grpc/keepalive"
"google.golang.org/grpc/metadata" "google.golang.org/grpc/metadata"
"gopkg.in/yaml.v3" "gopkg.in/yaml.v3"
) )
@@ -25,18 +27,41 @@ import (
// Configuration and Client Infrastructure // Configuration and Client Infrastructure
// Do not modify this file unless you need to customize the client behavior. // Do not modify this file unless you need to customize the client behavior.
// Implement your smart contract logic in process.go instead. // Implement your smart contract logic in process.go instead.
//
// Durability contract (provided by this file, no work for the user):
// - If the Dragonchain Prime server restarts, updates, or momentarily
// drops the network, this client auto-reconnects. Transactions
// observed during the outage are queued by prime and delivered once
// the stream is re-established.
// - If this client restarts (crash, deploy, long sleep), it rejoins
// the stream and prime re-delivers every still-pending transaction
// that should have invoked it, oldest first.
// - Half-open TCP (a silent peer that never sent FIN) is detected
// within ~13 s via gRPC keepalive pings. No dangling ghost streams.
// ============================================================================= // =============================================================================
// Config holds the client configuration loaded from YAML // Config holds the client configuration loaded from YAML
type Config struct { type Config struct {
ServerAddress string `yaml:"server_address"` ServerAddress string `yaml:"server_address"`
SmartContractID string `yaml:"smart_contract_id"` ChainID string `yaml:"chain_id"`
APIKey string `yaml:"api_key"` SmartContractID string `yaml:"smart_contract_id"`
UseTLS bool `yaml:"use_tls"` APIKey string `yaml:"api_key"`
TLSCertPath string `yaml:"tls_cert_path"` UseTLS bool `yaml:"use_tls"`
NumWorkers int `yaml:"num_workers"` TLSCertPath string `yaml:"tls_cert_path"`
ReconnectDelaySecs int `yaml:"reconnect_delay_seconds"` NumWorkers int `yaml:"num_workers"`
MaxReconnectAttempts int `yaml:"max_reconnect_attempts"`
// ReconnectDelaySecs is the BASE backoff between reconnect attempts.
// The effective delay is `base * 2^attempts + jitter` capped at
// MaxBackoffSeconds — so repeated failures back off, but a clean
// server restart is picked up within a few seconds.
ReconnectDelaySecs int `yaml:"reconnect_delay_seconds"`
// MaxBackoffSeconds caps the exponential backoff. Default 120.
MaxBackoffSeconds int `yaml:"max_backoff_seconds"`
// MaxReconnectAttempts: 0 = infinite (default and recommended — the
// whole point of this client is to stay available indefinitely).
MaxReconnectAttempts int `yaml:"max_reconnect_attempts"`
} }
// Client manages the gRPC connection and request processing // Client manages the gRPC connection and request processing
@@ -72,6 +97,18 @@ func (c *Client) Connect() error {
opts = append(opts, grpc.WithTransportCredentials(insecure.NewCredentials())) opts = append(opts, grpc.WithTransportCredentials(insecure.NewCredentials()))
} }
// Keepalive is the load-bearing piece for detecting a half-open
// connection. Without it, a silent peer (prime restarted without
// sending FIN; laptop resumed from sleep; corporate NAT dropped the
// flow) leaves us in a "connected" state until the OS-level TCP
// keepalive eventually fires — which on Linux defaults to ~2 hours.
// 10 s ping + 3 s timeout catches all of that within ~13 s.
opts = append(opts, grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 10 * time.Second,
Timeout: 3 * time.Second,
PermitWithoutStream: true,
}))
conn, err := grpc.NewClient(c.config.ServerAddress, opts...) conn, err := grpc.NewClient(c.config.ServerAddress, opts...)
if err != nil { if err != nil {
return fmt.Errorf("failed to connect to server: %w", err) return fmt.Errorf("failed to connect to server: %w", err)
@@ -91,17 +128,31 @@ func (c *Client) Close() error {
return nil return nil
} }
// Run starts the client and processes incoming requests // Run starts the client and processes incoming requests. It returns
// when the stream terminates for any reason (server close, network
// error, ctx cancellation). The outer reconnect loop in main() calls
// Run again after a backoff.
func (c *Client) Run(ctx context.Context) error { func (c *Client) Run(ctx context.Context) error {
// Create metadata with authentication headers // Wrap ctx with our own cancel so the sender goroutine can tear down
// the stream on Send errors — otherwise stream.Recv() in the main
// loop could block forever waiting for a peer that is never coming
// back. Any cancel from here propagates to both directions of the
// bidi stream.
streamCtx, streamCancel := context.WithCancel(ctx)
defer streamCancel()
// Auth + routing metadata. x-chain-id is required by the server; a
// missing header yields "missing chain ID" from prime and no
// transactions will arrive.
md := metadata.Pairs( md := metadata.Pairs(
"x-api-key", c.config.APIKey, "x-api-key", c.config.APIKey,
"x-smart-contract-id", c.config.SmartContractID, "x-smart-contract-id", c.config.SmartContractID,
"x-chain-id", c.config.ChainID,
) )
ctx = metadata.NewOutgoingContext(ctx, md) streamCtx = metadata.NewOutgoingContext(streamCtx, md)
// Establish the bi-directional stream // Establish the bi-directional stream
stream, err := c.grpcClient.Run(ctx) stream, err := c.grpcClient.Run(streamCtx)
if err != nil { if err != nil {
return fmt.Errorf("failed to establish stream: %w", err) return fmt.Errorf("failed to establish stream: %w", err)
} }
@@ -110,15 +161,17 @@ func (c *Client) Run(ctx context.Context) error {
// Channel to collect responses from workers // Channel to collect responses from workers
responseChan := make(chan *pb.SmartContractResponse, c.config.NumWorkers*2) responseChan := make(chan *pb.SmartContractResponse, c.config.NumWorkers*2)
errChan := make(chan error, 1) errChan := make(chan error, 2)
// Start worker goroutines // Start worker goroutines
for i := 0; i < c.config.NumWorkers; i++ { for i := 0; i < c.config.NumWorkers; i++ {
c.wg.Add(1) c.wg.Add(1)
go c.worker(ctx, responseChan) go c.worker(streamCtx, responseChan)
} }
// Goroutine to send responses back to server // Sender: forwards worker responses back to the server. Any Send
// error immediately cancels streamCtx so the Recv loop below exits
// instead of blocking forever.
go func() { go func() {
for resp := range responseChan { for resp := range responseChan {
if err := stream.Send(resp); err != nil { if err := stream.Send(resp); err != nil {
@@ -127,12 +180,17 @@ func (c *Client) Run(ctx context.Context) error {
case errChan <- err: case errChan <- err:
default: default:
} }
streamCancel()
return return
} }
} }
}() }()
// Main loop: receive requests and dispatch to workers // Main loop: receive requests and dispatch to workers. stream.Recv
// returns when the peer closes the stream, when streamCtx is cancelled
// (e.g. because the sender goroutine hit an error), or on a real
// transport error.
var recvErr error
for { for {
req, err := stream.Recv() req, err := stream.Recv()
if err == io.EOF { if err == io.EOF {
@@ -140,24 +198,40 @@ func (c *Client) Run(ctx context.Context) error {
break break
} }
if err != nil { if err != nil {
return fmt.Errorf("error receiving request: %w", err) recvErr = err
break
} }
c.logger.Printf("Received request: transaction_id=%s", req.TransactionId) c.logger.Printf("Received request: transaction_id=%s", req.TransactionId)
select { select {
case c.workChan <- req: case c.workChan <- req:
case <-ctx.Done(): case <-streamCtx.Done():
return ctx.Err() recvErr = streamCtx.Err()
goto cleanup
} }
} }
// Cleanup cleanup:
// Tear down in-flight workers. Cancelling streamCtx was already done
// via defer; close(workChan) lets the worker goroutines exit their
// range loop cleanly.
close(c.workChan) close(c.workChan)
c.wg.Wait() c.wg.Wait()
close(responseChan) close(responseChan)
c.workChan = make(chan *pb.SmartContractRequest, c.config.NumWorkers*2)
return nil if recvErr != nil {
return fmt.Errorf("error receiving request: %w", recvErr)
}
// Surface any earlier Send error the sender goroutine parked on
// errChan so the reconnect loop sees it.
select {
case err := <-errChan:
return fmt.Errorf("stream send error: %w", err)
default:
return nil
}
} }
// worker processes requests from the work channel // worker processes requests from the work channel
@@ -222,7 +296,8 @@ func LoadConfig(path string) (*Config, error) {
config := &Config{ config := &Config{
NumWorkers: 10, NumWorkers: 10,
ReconnectDelaySecs: 5, ReconnectDelaySecs: 3,
MaxBackoffSeconds: 120,
} }
if err := yaml.Unmarshal(data, config); err != nil { if err := yaml.Unmarshal(data, config); err != nil {
@@ -233,6 +308,9 @@ func LoadConfig(path string) (*Config, error) {
if config.ServerAddress == "" { if config.ServerAddress == "" {
return nil, fmt.Errorf("server_address is required") return nil, fmt.Errorf("server_address is required")
} }
if config.ChainID == "" {
return nil, fmt.Errorf("chain_id is required")
}
if config.SmartContractID == "" { if config.SmartContractID == "" {
return nil, fmt.Errorf("smart_contract_id is required") return nil, fmt.Errorf("smart_contract_id is required")
} }
@@ -243,6 +321,39 @@ func LoadConfig(path string) (*Config, error) {
return config, nil return config, nil
} }
// nextBackoff returns the duration to sleep before the next reconnect.
// Computed as base * 2^attempts with a random jitter in [0, base) and
// capped at MaxBackoffSeconds. Jitter matters when many clients
// reconnect simultaneously after a server restart — it desynchronises
// them so they don't all slam accept() at the same instant.
func nextBackoff(cfg *Config, attempts int) time.Duration {
base := time.Duration(cfg.ReconnectDelaySecs) * time.Second
if base <= 0 {
base = 3 * time.Second
}
maxBackoff := time.Duration(cfg.MaxBackoffSeconds) * time.Second
if maxBackoff <= 0 {
maxBackoff = 120 * time.Second
}
// Cap the exponent so we don't overflow. 2^10 = 1024 ≈ always
// clipped by maxBackoff anyway, but keep the math bounded.
shift := attempts
if shift > 10 {
shift = 10
}
delay := base << shift
if delay > maxBackoff {
delay = maxBackoff
}
// Jitter range == base, independent of attempts. Adding it ensures
// we don't schedule a thundering herd on the next attempt even if
// every client started with the same `attempts` count.
jitter := time.Duration(rand.Int63n(int64(base)))
return delay + jitter
}
func main() { func main() {
configPath := flag.String("config", "config.yaml", "Path to configuration file") configPath := flag.String("config", "config.yaml", "Path to configuration file")
flag.Parse() flag.Parse()
@@ -268,20 +379,29 @@ func main() {
cancel() cancel()
}() }()
// Connection loop with reconnection logic // Connection loop with reconnection logic. A "successful" session
// is defined as one where Run() was active for at least 60 s —
// anything longer is almost certainly real work, so reset attempts
// so the next failure starts the backoff schedule fresh.
attempts := 0 attempts := 0
const healthyRunThreshold = 60 * time.Second
for { for {
if err := client.Connect(); err != nil { if err := client.Connect(); err != nil {
log.Printf("Connection failed: %v", err) log.Printf("Connection failed: %v", err)
} else { } else {
attempts = 0 start := time.Now()
if err := client.Run(ctx); err != nil { if err := client.Run(ctx); err != nil {
if ctx.Err() != nil { if ctx.Err() != nil {
log.Println("Shutdown requested") log.Println("Shutdown requested")
_ = client.Close()
break break
} }
log.Printf("Stream error: %v", err) log.Printf("Stream error: %v", err)
} }
if time.Since(start) > healthyRunThreshold {
attempts = 0
}
} }
_ = client.Close() _ = client.Close()
@@ -297,13 +417,13 @@ func main() {
break break
} }
delay := time.Duration(config.ReconnectDelaySecs) * time.Second delay := nextBackoff(config, attempts-1)
log.Printf("Reconnecting in %v (attempt %d)...", delay, attempts) log.Printf("Reconnecting in %v (attempt %d)...", delay, attempts)
select { select {
case <-time.After(delay): case <-time.After(delay):
case <-ctx.Done(): case <-ctx.Done():
break return
} }
} }