Boost.Corosio Performance Benchmarks

Executive Summary

This report presents comprehensive performance benchmarks comparing Boost.Corosio against Boost.Asio on Windows using the IOCP (I/O Completion Ports) backend. The benchmarks cover handler dispatch, socket throughput, socket latency, and HTTP server workloads.

Bottom Line

Corosio demonstrates exceptional single-threaded handler dispatch performance (2× faster than Asio) and superior interleaved post/run throughput (70% faster). However, Asio shows better multi-threaded scaling in both handler dispatch and HTTP server workloads. Socket I/O throughput is essentially identical between the two implementations.

Where Corosio Excels

  • Single-threaded handler post: 2× faster than Asio (1.59 Mops/s vs 802 Kops/s)

  • Interleaved post/run: 70% faster (2.90 Mops/s vs 1.71 Mops/s)

  • Concurrent post and run: 14% faster (1.68 Mops/s vs 1.48 Mops/s)

  • Large-buffer throughput: Essentially identical, slight edge at some buffer sizes

Where Corosio Needs Improvement

  • Multi-threaded handler scaling: Throughput regresses from 4→8 threads (2.58→2.09 Mops/s)

  • Multi-threaded HTTP: Asio is 56% faster at 8 threads (337.68 vs 215.94 Kops/s)

  • Tail latency: p99 latency ~50% higher than Asio (21 μs vs 14 μs)

  • Concurrent connections: Latency increases faster than Asio under load

Key Insights

Component Assessment

Handler Dispatch

Corosio is significantly faster single-threaded, but Asio scales better with threads

Socket I/O

Essentially identical throughput; Asio has ~0.5 μs lower latency per operation

HTTP Server

Asio outperforms at all thread counts; gap widens with more threads

Scaling Behavior

Corosio shows thread contention issues at 8 threads

Next Steps

  1. Profile multi-threaded contention: Investigate the 4→8 thread regression

  2. Reduce per-operation latency: Target the ~0.5 μs gap in socket operations

  3. Benchmark on Linux: Validate findings on epoll backend

  4. Test realistic workloads: Mixed payload sizes and real-world traffic patterns


Detailed Results

Handler Dispatch Summary

Scenario Corosio Asio Winner

Single-threaded post

1.59 Mops/s

802 Kops/s

Corosio (+98%)

Multi-threaded (8 threads)

2.09 Mops/s

3.02 Mops/s

Asio (+44%)

Interleaved post/run

2.90 Mops/s

1.71 Mops/s

Corosio (+70%)

Concurrent post/run

1.68 Mops/s

1.48 Mops/s

Corosio (+14%)

Socket Throughput Summary

Scenario Corosio Asio Winner

Unidirectional 1KB buffer

215 MB/s

213 MB/s

Tie

Unidirectional 64KB buffer

6.43 GB/s

6.40 GB/s

Tie

Bidirectional 64KB buffer

6.15 GB/s

6.50 GB/s

Asio (+6%)

Socket Latency Summary

Scenario Corosio Asio Winner

Ping-pong mean (64B)

10.10 μs

9.61 μs

Asio (-5%)

Ping-pong p99 (64B)

21.20 μs

13.30 μs

Asio (-37%)

16 concurrent pairs

162.95 μs

160.49 μs

Tie

HTTP Server Summary

Scenario Corosio Asio Winner

Single connection

96.31 Kops/s

95.96 Kops/s

Tie

32 connections, 8 threads

215.94 Kops/s

337.68 Kops/s

Asio (+56%)

Test Environment

Platform

Windows (IOCP backend)

Benchmarks

Handler dispatch, socket throughput, socket latency, HTTP server

Measurement

Client-side latency and throughput

Handler Dispatch Benchmarks

These benchmarks measure raw handler posting and execution throughput, isolating the scheduler from I/O completion overhead.

Single-Threaded Handler Post

Posting 5,000,000 handlers from a single thread.

Metric Corosio Asio Difference

Handlers

5,000,000

5,000,000

Elapsed

3.143 s

6.233 s

-50%

Throughput

1.59 Mops/s

802 Kops/s

+98%

Key finding: Corosio’s single-threaded handler dispatch is nearly 2× faster than Asio.

Multi-Threaded Scaling

Multiple threads running handlers concurrently (5,000,000 handlers total).

Threads Corosio Asio Corosio Speedup Asio Speedup

1

2.46 Mops/s

1.51 Mops/s

(baseline)

(baseline)

2

2.24 Mops/s

2.16 Mops/s

0.91×

1.43×

4

2.58 Mops/s

2.97 Mops/s

1.05×

1.96×

8

2.09 Mops/s

3.02 Mops/s

0.85×

1.99×

Scaling Analysis

Throughput vs Thread Count:

Threads    Corosio    Asio       Winner
   1       2.46       1.51       Corosio +63%
   2       2.24       2.16       Corosio +4%
   4       2.58       2.97       Asio +15%
   8       2.09       3.02       Asio +44%
                ↑
           (regression)

Notable observations:

  • Corosio is faster at 1-2 threads

  • Crossover occurs between 2-4 threads

  • Corosio regresses from 4→8 threads (2.58 → 2.09 Mops/s)

  • Asio continues scaling through 8 threads

Interleaved Post/Run

Alternating between posting batches and running them (50,000 iterations × 100 handlers).

Metric Corosio Asio Difference

Total handlers

5,000,000

5,000,000

Elapsed

1.723 s

2.930 s

-41%

Throughput

2.90 Mops/s

1.71 Mops/s

+70%

Key finding: Corosio excels at interleaved post/run patterns—a common pattern in real applications.

Concurrent Post and Run

Four threads simultaneously posting and running handlers.

Metric Corosio Asio Difference

Threads

4

4

Total handlers

5,000,000

5,000,000

Elapsed

2.970 s

3.374 s

-12%

Throughput

1.68 Mops/s

1.48 Mops/s

+14%

Socket Throughput Benchmarks

Unidirectional Throughput

Single direction transfer of 4096 MB with varying buffer sizes.

Buffer Size Corosio Asio Difference

1024 bytes

215.20 MB/s

213.17 MB/s

+1%

4096 bytes

757.98 MB/s

743.34 MB/s

+2%

16384 bytes

2.56 GB/s

2.58 GB/s

-1%

65536 bytes

6.43 GB/s

6.40 GB/s

+0.5%

Observation: Throughput is essentially identical. Both implementations achieve excellent performance at large buffer sizes.

Bidirectional Throughput

Simultaneous transfer of 2048 MB in each direction (4096 MB total).

Buffer Size Corosio Asio Difference

1024 bytes

214.55 MB/s

212.18 MB/s

+1%

4096 bytes

707.35 MB/s

755.43 MB/s

-6%

16384 bytes

2.48 GB/s

2.59 GB/s

-4%

65536 bytes

6.15 GB/s

6.50 GB/s

-5%

Observation: Asio has a slight edge in bidirectional throughput at larger buffer sizes, but differences are small.

Socket Latency Benchmarks

Ping-Pong Round-Trip Latency

Single socket pair exchanging messages (1,000,000 iterations each).

Message Size Corosio Mean Asio Mean Difference Corosio p99 Asio p99

1 byte

10.04 μs

9.66 μs

+4%

21.10 μs

14.20 μs

64 bytes

10.10 μs

9.61 μs

+5%

21.20 μs

13.30 μs

1024 bytes

10.03 μs

9.66 μs

+4%

21.10 μs

12.30 μs

Latency Distribution (64-byte messages)

Percentile Corosio Asio Difference

p50

9.60 μs

9.20 μs

+4%

p90

9.80 μs

9.70 μs

+1%

p99

21.20 μs

13.30 μs

+59%

p99.9

115.70 μs

76.40 μs

+51%

min

8.30 μs

8.10 μs

+2%

max

3.15 ms

2.13 ms

+48%

Observation: Mean latencies are very close (~0.5 μs difference), but Corosio has significantly higher tail latency (p99+).

Concurrent Socket Pairs

Multiple socket pairs operating concurrently (64-byte messages).

Pairs Iterations Corosio Mean Asio Mean Corosio p99 Asio p99

1

1,000,000

9.95 μs

9.55 μs

19.20 μs

13.10 μs

4

500,000

40.90 μs

39.54 μs

81.88 μs

69.60 μs

16

250,000

162.95 μs

160.49 μs

357.36 μs

344.09 μs

Observation: Both implementations scale similarly with concurrent pairs. Asio maintains a small latency advantage throughout.

HTTP Server Benchmarks

Single Connection (Sequential Requests)

Metric Corosio Asio Difference

Requests

1,000,000

1,000,000

Elapsed

10.383 s

10.421 s

-0.4%

Throughput

96.31 Kops/s

95.96 Kops/s

+0.4%

Mean latency

10.36 μs

10.39 μs

-0.3%

p99 latency

14.70 μs

13.80 μs

+7%

Observation: Single-connection HTTP performance is essentially identical.

Concurrent Connections (Single Thread)

Connections Corosio Throughput Asio Throughput Corosio Mean Asio Mean Gap

1

92.71 Kops/s

92.35 Kops/s

10.76 μs

10.80 μs

Tie

4

92.64 Kops/s

91.14 Kops/s

43.15 μs

43.86 μs

Tie

16

92.03 Kops/s

90.38 Kops/s

173.83 μs

177.00 μs

Tie

32

92.14 Kops/s

89.11 Kops/s

347.27 μs

359.06 μs

Corosio +3%

Observation: Single-threaded HTTP performance scales identically with connection count.

Multi-Threaded HTTP (32 Connections)

Threads Corosio Throughput Asio Throughput Gap Scaling Factor

1

89.72 Kops/s

88.25 Kops/s

+2%

(baseline)

2

127.27 Kops/s

127.48 Kops/s

0%

1.42× / 1.44×

4

141.15 Kops/s

210.64 Kops/s

-33%

1.57× / 2.39×

8

215.94 Kops/s

337.68 Kops/s

-36%

2.41× / 3.83×

Multi-Threaded Latency

Threads Corosio Mean Asio Mean Corosio p99 Asio p99

1

356.63 μs

362.58 μs

748.50 μs

620.88 μs

2

251.37 μs

250.92 μs

384.09 μs

352.85 μs

4

226.46 μs

151.75 μs

447.79 μs

192.31 μs

8

147.86 μs

94.26 μs

188.26 μs

120.68 μs

Key finding: Asio scales significantly better in multi-threaded HTTP workloads, achieving 3.83× scaling from 1→8 threads compared to Corosio’s 2.41×.

Analysis

Performance Characteristics

Handler Dispatch

Corosio shows dramatically better single-threaded performance but struggles with multi-threaded scaling:

Scenario Corosio Advantage Notes

Single-threaded

+98%

Nearly 2× faster

Interleaved post/run

+70%

Excellent batch handling

Concurrent 4 threads

+14%

Still competitive

8 threads

-44%

Scaling regression

Socket I/O

Socket throughput is essentially identical between implementations. Latency shows:

  • Mean latency: Corosio ~0.5 μs slower

  • Tail latency: Corosio ~50% higher at p99

HTTP Server

The HTTP benchmarks reveal a scaling disparity:

Multi-threaded HTTP Throughput:

Threads    Corosio      Asio        Winner
   1       89.7 K       88.3 K      Tie
   2       127.3 K      127.5 K     Tie
   4       141.2 K      210.6 K     Asio +49%
   8       215.9 K      337.7 K     Asio +56%

Scaling Behavior

The benchmarks reveal a consistent pattern:

Behavior Evidence

Single-threaded excellence

2× faster handler dispatch, competitive HTTP

Multi-thread contention

Regression at 8 threads in handler dispatch

HTTP scaling gap

Asio achieves 3.83× scaling vs Corosio’s 2.41×

Conclusions

Strengths

Corosio:

  • Exceptional single-threaded handler dispatch (2× faster)

  • Superior interleaved post/run performance (70% faster)

  • Competitive socket I/O throughput

  • Identical single-connection HTTP performance

Asio:

  • Better multi-threaded scaling (no regression at 8 threads)

  • Superior multi-threaded HTTP throughput (+56% at 8 threads)

  • Lower tail latency in socket operations

  • More predictable performance under load

Recommendations

Workload Recommendation

Single-threaded handler processing

Corosio is 2× faster

Interleaved post/run patterns

Corosio is 70% faster

Multi-threaded HTTP servers

Asio scales better (56% faster at 8 threads)

Bulk socket transfers

Either—performance is identical

Future Work

  • Profile the multi-threaded contention causing 8-thread regression

  • Investigate HTTP scaling disparity

  • Benchmark on Linux (epoll backend)

  • Test with realistic HTTP payloads and traffic patterns

Appendix: Raw Data

Corosio Results

Backend: iocp

=== Single-threaded Handler Post ===
  Handlers:    5000000
  Elapsed:     3.143 s
  Throughput:  1.59 Mops/s

=== Multi-threaded Scaling ===
  Handlers per test: 5000000

  1 thread(s): 2.46 Mops/s
  2 thread(s): 2.24 Mops/s (speedup: 0.91x)
  4 thread(s): 2.58 Mops/s (speedup: 1.05x)
  8 thread(s): 2.09 Mops/s (speedup: 0.85x)

=== Interleaved Post/Run ===
  Iterations:        50000
  Handlers/iter:     100
  Total handlers:    5000000
  Elapsed:           1.723 s
  Throughput:        2.90 Mops/s

=== Concurrent Post and Run ===
  Threads:           4
  Handlers/thread:   1250000
  Total handlers:    5000000
  Elapsed:           2.970 s
  Throughput:        1.68 Mops/s

=== Unidirectional Throughput ===
  Buffer size: 1024 bytes, Transfer: 4096 MB
    Throughput: 215.20 MB/s

  Buffer size: 4096 bytes, Transfer: 4096 MB
    Throughput: 757.98 MB/s

  Buffer size: 16384 bytes, Transfer: 4096 MB
    Throughput: 2.56 GB/s

  Buffer size: 65536 bytes, Transfer: 4096 MB
    Throughput: 6.43 GB/s

=== Bidirectional Throughput ===
  Buffer size: 1024 bytes: 214.55 MB/s (combined)
  Buffer size: 4096 bytes: 707.35 MB/s (combined)
  Buffer size: 16384 bytes: 2.48 GB/s (combined)
  Buffer size: 65536 bytes: 6.15 GB/s (combined)

=== Ping-Pong Round-Trip Latency ===
  1 byte:    mean=10.04 us, p99=21.10 us
  64 bytes:  mean=10.10 us, p99=21.20 us
  1024 bytes: mean=10.03 us, p99=21.10 us

=== Concurrent Socket Pairs Latency ===
  1 pair:   mean=9.95 us, p99=19.20 us
  4 pairs:  mean=40.90 us, p99=81.88 us
  16 pairs: mean=162.95 us, p99=357.36 us

=== HTTP Single Connection ===
  Throughput: 96.31 Kops/s
  Latency: mean=10.36 us, p99=14.70 us

=== HTTP Multi-threaded (32 connections) ===
  1 thread:  89.72 Kops/s, mean=356.63 us
  2 threads: 127.27 Kops/s, mean=251.37 us
  4 threads: 141.15 Kops/s, mean=226.46 us
  8 threads: 215.94 Kops/s, mean=147.86 us

Asio Results

=== Single-threaded Handler Post ===
  Handlers:    5000000
  Elapsed:     6.233 s
  Throughput:  802.18 Kops/s

=== Multi-threaded Scaling ===
  Handlers per test: 5000000

  1 thread(s): 1.51 Mops/s
  2 thread(s): 2.16 Mops/s (speedup: 1.43x)
  4 thread(s): 2.97 Mops/s (speedup: 1.96x)
  8 thread(s): 3.02 Mops/s (speedup: 1.99x)

=== Interleaved Post/Run ===
  Iterations:        50000
  Handlers/iter:     100
  Total handlers:    5000000
  Elapsed:           2.930 s
  Throughput:        1.71 Mops/s

=== Concurrent Post and Run ===
  Threads:           4
  Handlers/thread:   1250000
  Total handlers:    5000000
  Elapsed:           3.374 s
  Throughput:        1.48 Mops/s

=== Unidirectional Throughput ===
  Buffer size: 1024 bytes: 213.17 MB/s
  Buffer size: 4096 bytes: 743.34 MB/s
  Buffer size: 16384 bytes: 2.58 GB/s
  Buffer size: 65536 bytes: 6.40 GB/s

=== Bidirectional Throughput ===
  Buffer size: 1024 bytes: 212.18 MB/s (combined)
  Buffer size: 4096 bytes: 755.43 MB/s (combined)
  Buffer size: 16384 bytes: 2.59 GB/s (combined)
  Buffer size: 65536 bytes: 6.50 GB/s (combined)

=== Ping-Pong Round-Trip Latency ===
  1 byte:    mean=9.66 us, p99=14.20 us
  64 bytes:  mean=9.61 us, p99=13.30 us
  1024 bytes: mean=9.66 us, p99=12.30 us

=== Concurrent Socket Pairs Latency ===
  1 pair:   mean=9.55 us, p99=13.10 us
  4 pairs:  mean=39.54 us, p99=69.60 us
  16 pairs: mean=160.49 us, p99=344.09 us

=== HTTP Single Connection ===
  Throughput: 95.96 Kops/s
  Latency: mean=10.39 us, p99=13.80 us

=== HTTP Multi-threaded (32 connections) ===
  1 thread:  88.25 Kops/s, mean=362.58 us
  2 threads: 127.48 Kops/s, mean=250.92 us
  4 threads: 210.64 Kops/s, mean=151.75 us
  8 threads: 337.68 Kops/s, mean=94.26 us