Update: Added iPhone 5.
Update: Added iPhone 4s, iPad 3rd gen.
Update: Added iPhone 4, iPad 1st gen.

I follow the excellent weekly posts by Mike Ash, and entered a brief discussion in comments about toll free bridging. In particular, the difference between calling a method via Objective-C (objc_msgSend) and it’s equivalent CoreFoundation C call. Mike suggested adding it to his original suite of tests, which lead to the following results.

iPhone 5 (-mno-thumb)

Custom Apple A6 ARM Cortex A15, up to 1.2GHz

NameIterationsTotal time (sec)Time per (ns)
IMP-cached message send1000000000.33.1
C++ virtual method call1000000000.33.3
Integer division1000000001.110.9
Objective-C message send1000000001.413.7
Float division with int conversion100000000.224.7
Floating-point division1000000002.524.8
Objective-C objectAtIndex:100000000.435.9
CF CFArrayGetValueAtIndex100000000.550.9
16 byte memcpy100000000.765.8
16 byte malloc/free100000004.8482.7
NSAutoreleasePool alloc/init/release1000000.1533.4
NSObject alloc/init/release1000000.11169.0
NSInvocation message send1000000.11391.8
16MB malloc/free10000.013331.8
Zero-second delayed perform10000.199329.1
pthread create/join1000.0120390.0
1MB memcpy1000.0421517.1

iPhone 5 (thumb)

Custom Apple A6 ARM Cortex A15, up to 1.2GHz

NameIterationsTotal time (sec)Time per (ns)
IMP-cached message send1000000000.33.1
C++ virtual method call1000000000.44.0
Integer division1000000001.110.8
Objective-C message send1000000001.413.6
Float division with int conversion100000000.224.9
Floating-point division1000000002.626.4
Objective-C objectAtIndex:100000000.435.6
CF CFArrayGetValueAtIndex100000000.551.0
16 byte memcpy100000000.765.8
16 byte malloc/free100000004.7474.3
NSAutoreleasePool alloc/init/release1000000.1513.2
NSObject alloc/init/release1000000.11183.1
NSInvocation message send1000000.11241.5
16MB malloc/free10000.012979.7
Zero-second delayed perform10000.183574.5
pthread create/join1000.0121289.2
1MB memcpy1000.0426971.7

iPad 3 (thumb)

Apple A5x ARM Cortex A9 1000MHz

NameIterationsTotal time (sec)Time per (ns)
IMP-cached message send1000000001.111.5
C++ virtual method call1000000001.312.7
Floating-point division1000000002.626.1
16 byte memcpy100000000.326.2
Float division with int conversion100000000.326.3
Integer division1000000002.928.7
Objective-C message send1000000003.635.5
Objective-C objectAtIndex:100000000.769.0
CF CFArrayGetValueAtIndex100000001.3131.7
16 byte malloc/free100000004.3433.9
NSAutoreleasePool alloc/init/release1000000.1600.2
NSObject alloc/init/release1000000.11235.4
NSInvocation message send1000000.32966.6
16MB malloc/free10000.011633.0
Zero-second delayed perform10000.1121336.0
pthread create/join1000.0130293.3
1MB memcpy1000.21662780.4

iPad 3 (-mno-thumb)

Apple A5x ARM Cortex A9 1000MHz

NameIterationsTotal time (sec)Time per (ns)
IMP-cached message send1000000000.99.0
C++ virtual method call1000000001.111.1
Integer division1000000002.424.1
Floating-point division1000000002.626.1
Float division with int conversion100000000.326.1
16 byte memcpy100000000.327.1
Objective-C message send1000000002.727.2
Objective-C objectAtIndex:100000000.768.4
CF CFArrayGetValueAtIndex100000001.0103.3
16 byte malloc/free100000004.3432.5
NSAutoreleasePool alloc/init/release1000000.1570.4
NSObject alloc/init/release1000000.11209.9
NSInvocation message send1000000.21682.2
16MB malloc/free10000.010251.2
pthread create/join1000.0118494.2
Zero-second delayed perform10000.1121578.2
1MB memcpy1000.21635983.3

iPhone 4s (thumb)

Apple A5 ARM Cortex A9 ~800MHz

NameIterationsTotal time (sec)Time per (ns)
C++ virtual method call1000000001.111.3
IMP-cached message send1000000001.312.6
Integer division1000000003.131.4
Float division with int conversion100000000.332.6
Floating-point division1000000003.332.6
16 byte memcpy100000000.332.6
Objective-C message send1000000003.433.8
Objective-C objectAtIndex:100000000.985.9
CF CFArrayGetValueAtIndex100000001.6165.0
16 byte malloc/free100000005.4542.2
NSAutoreleasePool alloc/init/release1000000.1753.2
NSObject alloc/init/release1000000.21511.7
NSInvocation message send1000000.22111.9
16MB malloc/free10000.019033.7
pthread create/join1000.0142817.5
Zero-second delayed perform10000.1146302.7
1MB memcpy1000.21787482.1

iPhone 4 (-fthumb)

Apple A4 ARM Cortex A8 ~800MHz

NameIterationsTotal time (sec)Time per (ns)
IMP-cached message send1000000000.99.0
C++ virtual method call1000000001.010.2
16 byte memcpy100000000.436.2
Integer division1000000004.140.6
Objective-C message send1000000004.140.8
Floating-point division100000000.989.4
Objective-C objectAtIndex:100000001.1105.8
Float division with int conversion100000001.1105.8
CF CFArrayGetValueAtIndex100000001.7168.1
NSInvocation message send1000000.1550.8
16 byte malloc/free100000006.6656.3
NSAutoreleasePool alloc/init/release1000000.1979.5
NSObject alloc/init/release1000000.44277.9
16MB malloc/free10000.020406.7
pthread create/join1000.0139971.2
Zero-second delayed perform10000.2243883.3
1MB memcpy1000.11150657.9

iPhone 3GS (-fthumb)

ARM Cortex A8 ~600MHz / 1.66 ns per cycle

NameIterationsTotal time (sec)Time per (ns)
IMP-cached message send1000000001.211.7
C++ virtual method call1000000001.313.5
16 byte memcpy100000000.546.0
Objective-C message send1000000005.453.9
Integer division1000000006.362.9
Floating-point division100000001.2117.4
Float division with int conversion100000001.4138.2
Objective-C objectAtIndex:100000001.4140.1
CF CFArrayGetValueAtIndex100000002.2220.0
16 byte malloc/free100000006.4642.6
NSInvocation message send1000000.1723.0
NSAutoreleasePool alloc/init/release1000000.11305.9
NSObject alloc/init/release1000000.65743.7
16MB malloc/free10000.016104.0
pthread create/join1000.0185759.2
Zero-second delayed perform10000.4353519.4
1MB memcpy1000.22170179.2

iPhone 3GS (no thumb)

ARM Cortex A8 ~600MHz / 1.66 ns per cycle

NameIterationsTotal time (sec)Time per (ns)
IMP-cached message send1000000001.211.8
C++ virtual method call1000000004.342.9
Objective-C message send1000000005.959.2
CF CFArrayGetValueAtIndex100000001.097.9
Integer division1000000009.898.4
16 byte memcpy100000001.1109.3
Floating-point division100000001.2118.5
Objective-C objectAtIndex:100000001.3129.0
Float division with int conversion100000001.4142.6
16 byte malloc/free100000007.5748.6
NSInvocation message send1000000.1806.0
NSObject alloc/init/release1000000.54793.1
NSAutoreleasePool alloc/init/release1000000.54953.1
16MB malloc/free10000.017969.2
Zero-second delayed perform10000.2211840.4
pthread create/join1000.0214742.5
1MB memcpy1000.33162774.6

iPhone 3G

ARM1176 ~412MHz / 2.4ns per cycle

NameIterationsTotal time (sec)Time per (ns)
IMP-cached message send1000000003.938.6
C++ virtual method call1000000005.049.9
Floating-point division100000000.881.3
Float division with int conversion100000000.881.4
16 byte memcpy100000001.4136.0
Objective-C message send10000000014.9148.6
Integer division10000000016.2162.2
CF CFArrayGetValueAtIndex100000002.0201.7
Objective-C objectAtIndex:100000004.2418.3
NSInvocation message send1000000.21833.2
16 byte malloc/free1000000027.32729.8
NSObject alloc/init/release1000001.414179.1
NSAutoreleasePool alloc/init/release1000001.918956.7
16MB malloc/free10000.047811.3
Zero-second delayed perform10000.8803419.3
pthread create/join1000.11085830.0
1MB memcpy1001.09902796.7

iPad (-fthumb)

NameIterationsTotal time (sec)Time per (ns)
IMP-cached message send1000000000.77.1
C++ virtual method call1000000000.88.1
16 byte memcpy100000000.327.7
Objective-C message send1000000003.232.3
Integer division1000000003.433.7
CF CFArrayGetValueAtIndex100000000.658.8
Floating-point division100000000.770.5
Objective-C objectAtIndex:100000000.881.6
Float division with int conversion100000000.883.1
16 byte malloc/free100000003.6357.8
NSInvocation message send1000000.0470.8
NSAutoreleasePool alloc/init/release1000000.32957.0
NSObject alloc/init/release1000000.33080.2
16MB malloc/free10000.014824.2
pthread create/join1000.0127386.2
Zero-second delayed perform10000.2225271.3
1MB memcpy1000.11064566.2

iPad (-mno-thumb)

Apple A4 ARM Cortex A8 ~1GHz / 1 ns per cycle

NameIterationsTotal time (sec)Time per (ns)
IMP-cached message send1000000000.88.1
C++ virtual method call1000000002.221.8
16 byte memcpy100000000.328.2
Objective-C message send1000000003.232.5
Integer division1000000003.433.9
CF CFArrayGetValueAtIndex100000000.655.8
Floating-point division100000000.770.9
Objective-C objectAtIndex:100000000.881.6
Float division with int conversion100000000.882.8
16 byte malloc/free100000003.6358.3
NSInvocation message send1000000.0473.4
NSAutoreleasePool alloc/init/release1000000.33017.6
NSObject alloc/init/release1000000.33071.8
16MB malloc/free10000.014623.6
pthread create/join1000.0128674.6
Zero-second delayed perform10000.3255627.5
1MB memcpy1000.11063407.5

Note that I did reduce the iterations from the original tests, so whilst the total times are significantly less, the iteration times are still a reflection of overall performance. Compared to Mike’s results, these show that the IMP method is indeed faster as expected, but this was only after I changed to a release build. I also compiled these with Thumb disabled unless otherwise specified. I’ve recently watched some iTunes U videos released by Apple on optimizing OpenGL ES 2.0 and a key takeaway was that the Cortext A8 architecture should always be compiled with thumb enabled. The Cortex CPU uses the newer Thumb-2 instruction set, which has native instructions for floating point. The benefit of Thumb is reduced code size and potentially better performance by utilising the I-cache.

Observations

  • I’ve estimated the iPhone 4 CPU to be running at 800MHz. Looking at the increased speed over the 3GS for a number of benchmarks, the average is 1.333x increase. Multiplying 600MHz x 1.333 yields roughly 800MHz as the clock speed.
  • The IMP-cached message send is significantly faster on the newer Cortex CPU. I have read of improvements in the branch prediction logic, which is particularly important due to the greater penalty of a misprediction in the longer A8 pipeline. The code for executing the call is blx r8 r8 contains the target address of the function, and remains so for the duration of the test.
  • For the 3GS, the Objective-C message send is very close to the C++ virtual method call. I ran this test several times, and the behaviour didn’t change. The virtual method call is an indirect load of the pc register ldr pc, [r3] Without being able to access the PMC registers, I can’t be sure of mispredictions; however, I know that 9 instructions are executed every iteration in the C++ test. That suggests around 15ns / iteration; but, we’re at 42.9. Adding an additional 13 cycles every iteration (21.58ns) for a mispredition would get us to 37ns / iteration - much closer. Stepping in to the objc_msgSend function finds the cached method on the first pass, totaling 28 instructions per iteration. Given there are significantly more instructions for the Objective-C call, we’re probably seeing the benefits of the dual—issue architecture.
  • Memory performance of the 3GS is significantly higher. I’ve done some other micro-benchmarks, showing 2nd gen around 200 MB/s and 3rd gen around 800MB/s. With some very well placed cache-preloads, I’ve actually pushed the ARM1176 to almost 300MB/s.
  • Calling the objectAtIndex: using CoreFoundation API is 2x faster on older devices; however, the gap is less significant with the newer hardware. We’ve seen significant improvements to the objc_msgSend performance on the 3GS, which undoubtedly is making up much of the gap.
  • Floating point performance for scalar operations is slightly slower on the newer device.

Source code for this test is available here.