Monitoring and Logging Design¶
The successful deployment and operation of Corda (and associated CorDapps) in a production environment requires a supporting monitoring and management capability to ensure that both a Corda node (and its supporting middleware infrastructure) and deployed CorDapps execute in a functionally correct and consistent manner. A pro-active monitoring solution will enable the immediate alerting of unexpected behaviours and associated management tooling should enable swift corrective action.
This design defines the monitoring metrics and logging outputs, and associated implementation approach, required to enable a proactive enterprise management and monitoring solution of Corda nodes and their associated CorDapps. This also includes a set of “liveliness” checks to verify and validate correct functioning of a Corda node (and associated CorDapp).
In the above diagram, the left hand side dotted box represents the components within scope for this design. It is anticipated that 3rd party enterprise-wide system management solutions will closely follow the architectural component breakdown in the right hand side box, and thus seamlessly integrate with the proposed Corda event generation and logging design. The interface between the two is de-coupled and based on textual log file parsing and adoption of industry standard JMX MBean events.
Corda currently exposes several forms of monitorable content:
Application log files using the SLF4J (Simple Logging Facade for Java) which provides an abstraction over various concrete logging frameworks (several of which are used within other Corda dependent 3rd party libraries). Corda itself uses the Apache Log4j 2 framework for logging output to a set of configured loggers (to include a rolling file appender and the console). Currently the same set of rolling log files are used by both the node and CorDapp(s) deployed to the node. The log file policy specifies a 60 day rolling period (but preserving the most recent 10Gb) with a maximum of 10 log files per day.
Industry standard exposed JMX-based metrics, both standard JVM and custom application metrics are exposed directly using the Dropwizard.io JmxReporter facility. In addition Corda also uses the Jolokia framework to make these accessible over an HTTP endpoint. Typically, these metrics are also collated by 3rd party tools to provide pro-active monitoring, visualisation and re-active management.
A full list of currently exposed metrics can be found in the appendix A.
The Corda flow framework also has placeholder support for recording additional Audit data in application flows using a simple AuditService. Audit event types are currently loosely defined and data is stored in string form (as a description and contextual map of name-value pairs) together with a timestamp and principal name. This service does not currently have an implementation of the audit event data to a persistent store.
ProgressTracker component is used to report the progress of a flow throughout its business lifecycle, and is
typically configured to report the start of a specific business workflow step (often before and after message send and
receipt where other participants form part of a multi-staged business workflow). The progress tracking framework was
designed to become a vital part of how exceptions, errors, and other faults are surfaced to human operators for
investigation and resolution. It provides a means of exporting progress as a hierarchy of steps in a way that’s both
human readable and machine readable.
In addition, in-house Corda networks at R3 use the following tools:
- Standard DataDog probes are currently used to provide e-mail based alerting for running Corda nodes. Telegraf is used in conjunction with a Jolokia agent as a collector to parse emitted metric data and push these to DataDog.
- Investigation is underway to evaluate ELK as a mechanism for parsing, indexing, storing, searching, and visualising log file data.
- Add new metrics at the level of a Corda node, individual CorDapps, and other supporting Corda components (float, bridge manager, doorman)
- Support liveness checking of the node, deployed flows and services
- Review logging groups and severities in the node.
- Separate application logging from node logging.
- Implement the audit framework that is currently only a stubbed out API
- Ensure that Corda can be used with third party systems for monitoring, log collection and audit
Out of scope¶
- Recommendation of a specific set of monitoring tools.
- Monitoring of network infrastructure like the network map service.
- Monitoring of liveness of peers.
Expanding on the first goal identified above, the following requirements have been identified:
- Message queues: latency, number of queues/messages, backlog, bridging establishment and connectivity (success / failure)
- Database: connections (retries, errors), latency, query time
- RPC metrics, latency, authentication/authorisation checking (eg. number of successful / failed attempts).
- Signing performance (eg. signatures per sec).
- Deployed CorDapps
- Garbage collector and JVM statistics
- Number of flows broken down by type (including flow status and aging statistics: oldest, latest)
- Flow durations
- JDBC connections, latency/histograms
- RPC logging
- Shell logging (user/command pairs)
- Message queue
- Exception logging (including full stack traces)
- Crash dumps (full stack traces)
- Hardware Security Module (HSM) events.
- per CorDapp logging
- Security: login authentication and authorisation
- Business Event flow progress tracking
- System events (particularly failures)
Audit data should be stored in a secure, storage medium. Audit data should include sufficient contextual information to enable optimal off-line analysis. Auditing should apply to all Corda node processes (running CorDapps, notaries, oracles).
It is envisaged that operational management and support teams will use the metrics and information collated from this design, either directly or through an integrated enterprise-wide systems management platform, to perform the following:
- Validate liveness and correctness of Corda nodes and deployed CorDapps, and the physical machine or VM they are hosted on.
- Use logging to troubleshoot operational failures (in conjunction with other supporting failure information: eg. GC logs, stack traces)
- Use reported metrics to fine-tune and tweak operational systems parameters (including dynamic setting of logging modules and severity levels to enable detailed logging).
The following design decisions are to be confirmed:
- JMX for metric eventing and SLF4J for logging Both above are widely adopted mechanisms that enable pluggability and seamless interoperability with other 3rd party enterprise-wide system management solutions.
- Continue or discontinue usage of Jolokia? (TBC - most likely yes, subject to read-only security lock-down)
- Separation of Corda Node and CorDapp log outputs (TBC)
There are a number of activities and parts to the solution proposal:
Extend JMX metric reporting through the Corda Monitoring Service and associated jolokia conversion to REST/JSON) coverage (see implementation details) to include all Corda services (vault, key management, transaction storage, network map, attachment storage, identity, cordapp provision) & sub-sytems components (state machine)
Review and extend Corda log4j2 coverage (see implementation details) to ensure
- consistent use of severities according to situation
- consistent coverage across all modules and libraries
- consistent output format with all relevant contextual information (node identity, user/execution identity, flow session identity, version information)
- separation of Corda Node and CorDapp log outputs (TBC) For consistent interleaving reasons, it may be desirable to continue using combined log output.
Publication of a code style guide to define when to use different severity levels.
Implement a CorDapp to perform sanity checking of flow framework, fundamental corda services (vault, identity), and dependent middleware infrastructure (message broker, database).
Revisit and enhance as necessary the Audit service API, and provide a persistent backed implementation, to include:
- specification of Business Event Categories (eg. User authentication and authorisation, Flow-based triggering, Corda Service invocations, Oracle invocations, Flow-based send/receive calls, RPC invocations)
- auto-enabled with Progress Tracker as Business Event generator
- RDBMS backed persistent store (independent of Corda database), with adequate security controls (authenticated access and read-only permissioning). Captured information should be consistent with standard logging, and it may be desirable to define auditable loggers within log4j2 to automatically redirect certain types of log events to the audit service.
Ensure 3rd party middleware drivers (JDBC for database, MQ for messaging) and the JVM are correctly configured to export JMX metrics. Ensure the JVM Hotspot VM command-line parameters are tuned correctly to enable detailed troubleshooting upon failure. Many of these metrics are already automatically exposed to 3rd party profiling tools such as Yourkit.
Apache Artemis has a comprehensive management API that allows a user to modify a server configuration, create new resources (e.g. addresses and queues), inspect these resources (e.g. how many messages are currently held in a queue) and interact with it (e.g. to remove messages from a queue), and exposes key metrics using JMX (using role-based authentication using Artemis’s JAAS plug-in support to ensure Artemis cannot be controlled via JMX)..
As of Corda M11, Java serialisation in the Corda node has been restricted, meaning MBeans access via the JMX port will no longer work.
Usage of Jolokia requires bundling an associated jolokia-agent-war file on the classpath, and associated configuration to export JMX monitoring statistics and data over the Jolokia REST/JSON interface. An associated jolokia-access.xml configuration file defines role based permissioning to HTTP operations.
A number of 3rd party libraries and frameworks have been proposed which solve different parts of the end to end solution, albeit with most focusing on the Agent Collector (eg. collect metrics from systems then output them to some backend storage.), Event Storage and Search, and Visualization aspects of Systems Management and Monitoring. These include:
|Splunk||£||General purpose enterprise-wide system management solution which performs collection and indexing of data, searching, correlation and analysis, visualization and reporting, monitoring and alerting.|
|ELK||OS||The ELK stack is a collection of 3 open source products from Elastic which provide an end to end enterprise-wide system management solution:
Elasticsearch: NoSQL database based on Lucene search engine
Logstash: is a log pipeline tool that accepts inputs from various sources, executes different transformations, and exports the data to various targets. Kibana: is a visualization layer that works on top of Elasticsearch.
|ArcSight||£||Enterprise Security Manager|
|Collectd||OS||Collector agent (written in C circa 2005). Data acquisition and storage handled by over 90 plugins.|
|Telegraf||OS||Collector agent (written in Go, active community)|
|Graphite||OS||Monitoring tool that stores, retrieves, shares, and visualizes time-series data.|
|StatsD||OS||Collector daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP or TCP and sends aggregates to one or more pluggable backend services (e.g., Graphite).|
|fluentd||OS||Collector daemon which collects data directly from logs and databases. Often used to analyze event logs, application logs, and clickstreams (a series of mouse clicks).|
|Prometheus||OS||End to end monitoring solution using time-series data (eg. metric name and a set of key-value pairs) and includes collection, storage, query and visualization.|
|NewRelic||£||Full stack instrumentation for application monitoring and real-time analytics solution.|
Most of the above solutions are not within the scope of this design proposal, but should be capable of ingesting the outputs (logging and metrics) defined by this design.
In general, the requirements outlined in this design are cross-cutting concerns which affect the Corda codebase holistically, both for logging and capture/export of JMX metrics.
Public APIs impacted
- No Public API’s are impacted.
Internal APIs impacted
- No identified internal API’s are impacted.
No change anticipated to following service:
This service defines and used the Codahale
MetricsRegistry, which is used by all other Corda services.
Changes expected to:
This service has been specified but not implemented. The following event types have been defined (and may need reviewing):
FlowAppAuditEvent: used in
FlowStateMachine, exposed on
FlowLogic(but never called)
FlowPermissionAuditEvent: (as above)
- All modules packaged and shipped as part of a Corda distribution (as published to Artifactory / Maven): core, node, node-api, node-driver, finance, confidential-identities, test-common, test-utils, webserver, jackson, jfx, mock, rpc
The Health checker is a CorDapp which verifies the health and liveliness of the Corda node it is deployed and running within by performing the following activities:
Corda network and middleware infrastructure connectivity checking:
- Database connectivity
- Message broker connectivity
Network Map participants summary (count, list)
- Notary summary (type, [number of cluster members]
Flow framework verification
Implement a simple flow that performs a simple “in-node” (no external messaging to 3rd party processes) round trip, and by doing so, exercises:
- flow checkpointing (including persistence to relational data store)
- message subsystem verification (creation of a send-to-self queue for purpose of routing)
- custom CordaService invocation (verify and validate behaviour of an installed CordaService)
- vault querying (verify and validate behaviour of vault query mechanism)
[this CorDapp could perform a simple Issuance of a fictional Corda token, Spend Corda token to self, Corda token exit, plus a couple of Vault queries in between: one using the VaultQuery API and the other using a Custom Query via a registered @CordaService]
RPC triggering Autotriggering of above flow using RPC to exercise the following:
- messaging subsystem verification (RPC queuing)
- authenticaton and permissions checking (against underlying configuration)
The Health checker may be deployed as part of a Corda distribution and automatically invoked upon start-up and/or manually triggered via JMX or the nodes associated Crash shell (using the startFlow command)
Please note that the Health checker application is not responsible for determining the healthiness of a Corda Network. This is the responsibility of the network operator, and may include verification checks such as:
- correct functioning of Network Map Service (registration, discovery)
- correct functioning of configured Notary
- remote messaging sub-sytem (including bridge creation)
Metrics augmentation within Corda Subsystems and Components¶
Codahale provides the following types of reportable metrics:
- Gauge: is an instantaneous measurement of a value.
- Counter: is a gauge for a numeric value (specifically of type
AtomicLong) which can be incremented or decremented.
- Meter: measures mean throughput (eg. the rate of events over time, e.g., “requests per second”). Also measures one-, five-, and fifteen-minute exponentially-weighted moving average throughputs.
- Histogram: measures the statistical distribution of values in a stream of data (minimum, maximum, mean, median, 75th, 90th, 95th, 98th, 99th, and 99.9th percentiles).
- Timer: measures both the rate that a particular piece of code is called and the distribution of its duration (eg. rate of requests in requests per second).
- Health checks: provides a means of centralizing service (database, message broker health checks).
See Appendix B for summary of current JMX Metrics exported by the Corda codebase.
The following table identifies additional metrics to report for a Corda node:
|Component / Subsystem||Proposed Metric(s)|
|Database||Connectivity (health check)|
|Corda Persistence||Database configuration details:
Data source properties: JDBC driver, JDBC driver class name, URL
Database properties: isolation level, schema name, init database flag
Run-time metrics: total & in flight connection, session, transaction counts; committed / rolled back transaction (counter); transaction durations (metric)
|Message Broker||Connectivity (health check)|
|Corda Messaging Client|
|State Machine||Fiber thread pool queue size (counter), Live fibers (counter) , Fibers waiting for ledger commit (counter)
Flow Session Messages (counters): init, confirm, received, reject, normal end, error end, total received messages (for a given flow session, Id and state)
(in addition to existing metrics captured)
Flow error (count)
|Flow State Machine||Initiated flows (counter)
For a given flow session (counters): initiated flows, send, sendAndReceive, receive, receiveAll, retries upon send
For flow messaging (timers) to determine round trip latencies between send/receive interactions with counterparties.
Flow suspension metrics (count, age, wait reason, cordapp)
|RPC||For each RPC operation we should export metrics to report: calling user, round trip latency (timer), calling frequency (meter). Metric reporting should include the Corda RPC protocol version (should be the same as the node’s Platform Version) in play.
Failed requests would be of particular interest for alerting.
|Vault||round trip latency of Vault Queries (timer)
Soft locking counters for reserve, release (counter), elapsed times soft locks are held for per flow id (timer, histogram), list of soft locked flow ids and associated stateRefs.
attempt to soft lock fungible states for spending (timer)
|worker pool size (counter), verify duration (timer), verify throughput (meter), success (counter), failure counter), in flight (counter)|
|Notarisation||Notary details (type, members in cluster)
Counters for success, failures, failure types (conflict, invalid time window, invalid transaction, wrong notary), elapsed time (timer)
Ideally provide breakdown of latency across notarisation steps: state ref notary validation, signature checking, from sending to remote notary to receiving response
|RAFT Notary Service
(awaiting choice of new RAFT implementation)
|should include similar metrics to previous RAFT (see appendix).|
|SimpleNotaryService||success/failure uniqueness checking
success/failure time-window checking
|ValidatingNotaryService||as above plus success/failure of transaction validation|
|list of corDapps loaded in node, path used to load corDapp jars
Details per CorDapp: name, contract class names, initiated flows, rpc flows, service flows, schedulable flows, services, serialization whitelists, custom schemas, jar path
|KeyManagementService||signing requests (count), fresh key requests (count), fresh key and cert requests (count), number of loaded keys (count)|
|ContractUpgradeServiceImpl||number of authorisation upgrade requests (counter)|
|DBTransactionStorage||number of transactions in storage map (cache)
cache size (max. 1024), concurrency level (def. 8)
|Network Map||TBC (following re-engineering)|
|Identity Service||number or parties, keys, principals (in cache)
Identity verification count & latency (count, metric)
|Attachment Service||counters for open, import, checking requests
(in addition to exiting attachment count)
|Schema Service||list of registered schemas; schemaOptions per schema; table prefix.|
Logging augmentation within Corda Subsystems and Components¶
Need to ensure that Log4J2 log messages within Corda code are correctly categorized according to defined severities (from most specific to least):
- ERROR: an error in the application, possibly recoverable.
- WARNING: an event that might possible lead to an error.
- INFO: an event for informational purposes.
- DEBUG: a general debugging event.
- TRACE: a fine-grained debug message, typically capturing the flow through the application.
A logging style guide will be published to answer questions such as what severity level should be used and why when:
- A connection to a remote peer is unexpectedly terminated.
- A database connection timed out but was successfully re-established.
- A message was sent to a peer.
It is also important that we capture the correct amount of contextual information to enable rapid identification and resolution of issues using log file output. Specifically, within Corda we should include the following information in logged messages:
- Node identifier
- User name
- Flow id (runId, also referred to as
StateMachineRunId), if logging within a flow
- Other contextual Flow information (eg. counterparty), if logging within a flow
FlowStackSnapshotinformation for catastrophic flow failures. Note: this information is not currently supposed to be used in production (???).
- Session id information for RPC calls
- CorDapp name, if logging from within a CorDapp
See Appendix C for summary of current Logging and Progress Tracker Reporting coverage within the Corda codebase.
Custom logging for enhanced visibility and troubleshooting:¶
- Database SQL logging is controlled via explicit configuration of the Hibernate log4j2 logger as follows:
<Logger name="org.hibernate.SQL" level="debug" additivity="false"> <AppenderRef ref="Console-Appender"/> </Logger>
- Message broker (Apache Artemis) advanced logging is enabled by configuring log4j2 for each of the 6 available loggers defined. In general, Artemis logging is highly chatty so default logging is actually toned down for one of the defined loggers:
<Logger name="org.apache.activemq.artemis.core.server" level="error" additivity="false"> <AppenderRef ref="RollingFile-Appender"/> </Logger>
- Corda coin selection advanced logging - including display of prepared statement parameters (which are not displayed for certain database providers when enabling Hibernate debug logging):
<Logger name="net.corda.finance.contracts.asset.cash.selection" level="trace" additivity="false"> <AppenderRef ref="Console-Appender"/> </Logger>
Audit Service persistence implementation and enablement¶
- Implementation of the existing
AuditServiceAPI to write to a (pluggable) secure destination (database, message queue, other)
- Identification of Business Events that we should audit, and instrumentation of code to ensure the AuditService is called with the correct Event Type according to Business Event.
For Corda Flows it would be a good idea to use the
ProgressTrackercomponent as a means of sending Business audit events. Refer here for a detailed description of the ProgressTracker API.
- Identification of System Events that should be automatically audited.
- Specification of a database schema and associated object relational mapping implementation.
- Setup and configuration of separate database and user account.
Software Development Tools and Programming Standards to be adopted.¶
[Michele] proposes the adoption of an event-based propagation solution (and associated event-driven framework) based on separation of concerns (performance improvements through parallelisation, latency minimisation for mainline execution thread): mainstream flow logic, business audit event triggering, JMX metric reporting. This approach would continue to use the same libraries for JMX event triggering and file logging.
3rd party libraries
Jolokia is a JMX-HTTP bridge giving access to the raw data and operations without connecting to the JMX port directly. Jolokia defines the JSON and REST formats for accessing MBeans, and provides client libraries to work with that protocol as well.
Dropwizard Metrics (formerly Codahale) provides a toolkit of ways to measure the behavior of critical components in a production environment.
VisualVM is a visual tool integrating commandline JDK tools and lightweight profiling capabilities.
Appendix A - Corda exposed JMX Metrics¶
The following metrics are exposed directly by a Corda Node at run-time:
|Attachment Service||Attachments||Counts number of attachments persisted in database.|
|Verification Service||VerificationsInFlight||Gauge of number of in flight verifications handled by the out of process verification service.|
|RAFT Uniqueness Provider||RaftCluster.ThisServerStatus||Gauge|
|RAFT Uniqueness Provider||RaftCluster.MembersCount||Count|
|RAFT Uniqueness Provider||RaftCluster.Members||Gauge, containing a list of members (by server address)|
|State Machine Manager||Flows.InFlight||Gauge (number of instances of state machine manager)|
|State Machine Manager||Flows.CheckpointingRate||Meter|
|State Machine Manager||Flows.Started||Count|
|State Machine Manager||Flows.Finished||Count|
|Flow State Machine||FlowDuration||Timer|
Additionally, JMX metrics are also generated within the Corda node-driver performance testing utilities. Specifically, the
startPublishingFixedRateInjector defines and exposes
Appendix B - Corda Logging and Reporting coverage¶
Primary node services exposed publicly via ServiceHub (SH) or internally by ServiceHubInternal (SHI):
|VaultService||SH||NodeVaultService||extensive coverage including Vault Query api calls using
|NetworkMapCache||SH||NetworkMapCacheImpl||some logging (11x info, 1x warning)|
|IdentityService||SH||PersistentIdentityService||some logging (error, debug)|
|AttachmentStorage||SH||NodeAttachmentService||minimal logging (info)|
|MessagingService||SHI||NodeMessagingClient||Good coverage (error, warning, info, trace)|
|CordaPersistence||SHI||CordaPersistence||INFO coverage within
Corda subsystem components:
|NotaryService||SimpleNotaryService||some logging (warn) via
|NotaryService||RaftValidatingNotaryService||some coverage (info, debug) within
|NotaryService||BFTNonValidatingNotaryService||Logging coverage (info, debug)|
|Doorman||DoormanServer (Enterprise only)||Some logging (info, warn, error), and use of
Corda core flows:
|Flow name||Logging||Exception handling||Progress Tracking|
|NotaryFlow||none||NotaryException (NotaryError types: TimeWindowInvalid, TransactionInvalid, WrongNotary), IllegalStateException, some via
|ReceiveTransactionFlow||none||SignatureException, AttachmentResolutionException, TransactionResolutionException, TransactionVerificationException||none|
|ResolveTransactionsFlow||none||FetchDataFlow.HashNotFound (FlowException), ExcessivelyLargeTransactionGraph (FlowException)||none|
|FetchDataFlow||some logging (info)||FetchDataFlow.HashNotFound||none|
|SignTransactionFlow||none||FlowException, possibly other (general) Exception||RECEIVING, VERIFYING, SIGNING|
Corda finance flows:
|Flow name||Logging||Exception handling||Progress Tracking|
|AbstractCashFlow||none||CashException (FlowException)||GENERATING_ID, GENERATING_TX, SIGNING_TX, FINALISING_TX|
|CashIssueFlow||none||CashException (via call to
||GENERATING_TX, SIGNING_TX, FINALISING_TX|
|CashPaymentFlow||none||CashException (caused by
||GENERATING_ID, GENERATING_TX, SIGNING_TX, FINALISING_TX|
|CashExitFlow||none||CashException (caused by
||GENERATING_TX, SIGNING_TX, FINALISING_TX|
|CashIssueAndPaymentFlow||none||any thrown by
||RECEIVING, VERIFYING, SIGNING, COLLECTING_SIGNATURES, RECORDING|
|TwoPartyTradeFlow.Seller||none||FlowException, IllegalArgumentException via
||RECEIVING, VERIFYING, SIGNING, COLLECTING_SIGNATURES, RECORDING|
Confidential identities flows:
|Flow name||Logging||Exception handling||Progress Tracking|
|IdentitySyncFlow.Receive||none||CertificateExpiredException, CertificateNotYetValidException, InvalidAlgorithmParameterException||RECEIVING_IDENTITIES, RECEIVING_CERTIFICATES|
Appendix C - Apache Artemis JMX Event types and Queuing Metrics.¶
The following table contains a list of Notification Types and associated perceived importance to a Corda node at run-time:
The following table summarised the types of metrics associated with Message Queues:
|count||total number of messages added to a queue since the server started|
|countDelta||number of messages added to the queue since the last message counter update|
|messageCount||current number of messages in the queue|
|messageCountDelta||overall number of messages added/removed from the queue since the last message counter update. Positive value indicated more messages were added, negative vice versa.|
|lastAddTimestamp||timestamp of the last time a message was added to the queue|
|updateTimestamp||timestamp of the last message counter update|