下面的Java程序平均需要0.50到0.55秒的时间来运行:

public static void main(String[] args) {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * (i * i);
    }
    System.out.println(
        (double) (System.nanoTime() - startTime) / 1000000000 + " s");
    System.out.println("n = " + n);
}

如果我将2 * (I * I)替换为2 * I * I,它将花费0.60到0.65秒的时间运行。如何来吗?

我把程序的每个版本都运行了15次,在两者之间交替运行。以下是调查结果:

 2*(i*i)  │  2*i*i
──────────┼──────────
0.5183738 │ 0.6246434
0.5298337 │ 0.6049722
0.5308647 │ 0.6603363
0.5133458 │ 0.6243328
0.5003011 │ 0.6541802
0.5366181 │ 0.6312638
0.515149  │ 0.6241105
0.5237389 │ 0.627815
0.5249942 │ 0.6114252
0.5641624 │ 0.6781033
0.538412  │ 0.6393969
0.5466744 │ 0.6608845
0.531159  │ 0.6201077
0.5048032 │ 0.6511559
0.5232789 │ 0.6544526

2 * i * i的最快运行时间比2 * (i * i)的最慢运行时间长。如果它们具有相同的效率,发生这种情况的概率将小于1/2^15 * 100% = 0.00305%。


当前回答

使用Java 11并使用以下VM选项关闭循环展开的有趣观察:

-XX:LoopUnrollLimit=0

带有2 * (i * i)表达式的循环会产生更紧凑的本机代码1:

L0001: add    eax,r11d
       inc    r8d
       mov    r11d,r8d
       imul   r11d,r8d
       shl    r11d,1h
       cmp    r8d,r10d
       jl     L0001

与2 * I * I版本相比:

L0001: add    eax,r11d
       mov    r11d,r8d
       shl    r11d,1h
       add    r11d,2h
       inc    r8d
       imul   r11d,r8d
       cmp    r8d,r10d
       jl     L0001

Java版本:

java version "11" 2018-09-25
Java(TM) SE Runtime Environment 18.9 (build 11+28)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)

基准测试结果:

Benchmark          (size)  Mode  Cnt    Score     Error  Units
LoopTest.fast  1000000000  avgt    5  694,868 ±  36,470  ms/op
LoopTest.slow  1000000000  avgt    5  769,840 ± 135,006  ms/op

基准测试源代码:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(1)
public class LoopTest {

    @Param("1000000000") private int size;

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
            .include(LoopTest.class.getSimpleName())
            .jvmArgs("-XX:LoopUnrollLimit=0")
            .build();
        new Runner(opt).run();
    }

    @Benchmark
    public int slow() {
        int n = 0;
        for (int i = 0; i < size; i++)
            n += 2 * i * i;
        return n;
    }

    @Benchmark
    public int fast() {
        int n = 0;
        for (int i = 0; i < size; i++)
            n += 2 * (i * i);
        return n;
    }
}

1 -虚拟机选项使用:- xx:+UnlockDiagnosticVMOptions - xx:+PrintAssembly - xx:LoopUnrollLimit=0

其他回答

添加的两种方法生成的字节代码略有不同:

  17: iconst_2
  18: iload         4
  20: iload         4
  22: imul
  23: imul
  24: iadd

对于2 * (i * i) vs:

  17: iconst_2
  18: iload         4
  20: imul
  21: iload         4
  23: imul
  24: iadd

对于2 * i * i。

当像这样使用JMH基准时:

@Warmup(iterations = 5, batchSize = 1)
@Measurement(iterations = 5, batchSize = 1)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MyBenchmark {

    @Benchmark
    public int noBrackets() {
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            n += 2 * i * i;
        }
        return n;
    }

    @Benchmark
    public int brackets() {
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            n += 2 * (i * i);
        }
        return n;
    }

}

区别很明显:

# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: <none>

Benchmark                      (n)  Mode  Cnt    Score    Error  Units
MyBenchmark.brackets    1000000000  avgt    5  380.889 ± 58.011  ms/op
MyBenchmark.noBrackets  1000000000  avgt    5  512.464 ± 11.098  ms/op

你观察到的是正确的,而不仅仅是你的基准测试风格的异常(例如,没有热身,参见如何用Java编写正确的微基准测试?)

再次与Graal一起运行:

# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler

Benchmark                      (n)  Mode  Cnt    Score    Error  Units
MyBenchmark.brackets    1000000000  avgt    5  335.100 ± 23.085  ms/op
MyBenchmark.noBrackets  1000000000  avgt    5  331.163 ± 50.670  ms/op

您可以看到,结果更加接近,这是有道理的,因为Graal是一个整体性能更好、更现代的编译器。

因此,这实际上只是取决于JIT编译器优化特定代码段的能力,并不一定有逻辑上的原因。

我得到了类似的结果:

2 * (i * i): 0.458765943 s, n=119860736
2 * i * i: 0.580255126 s, n=119860736

如果两个循环都在同一个程序中,或者每个循环都在单独的.java文件/.class中,在单独的运行中执行,我得到了相同的结果。

最后,这里是一个javap -c -v <.java>的反编译:

     3: ldc           #3                  // String 2 * (i * i):
     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
    11: lstore_1
    12: iconst_0
    13: istore_3
    14: iconst_0
    15: istore        4
    17: iload         4
    19: ldc           #6                  // int 1000000000
    21: if_icmpge     40
    24: iload_3
    25: iconst_2
    26: iload         4
    28: iload         4
    30: imul
    31: imul
    32: iadd
    33: istore_3
    34: iinc          4, 1
    37: goto          17

vs.

     3: ldc           #3                  // String 2 * i * i:
     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
    11: lstore_1
    12: iconst_0
    13: istore_3
    14: iconst_0
    15: istore        4
    17: iload         4
    19: ldc           #6                  // int 1000000000
    21: if_icmpge     40
    24: iload_3
    25: iconst_2
    26: iload         4
    28: imul
    29: iload         4
    31: imul
    32: iadd
    33: istore_3
    34: iinc          4, 1
    37: goto          17

仅供参考,

java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Kasperd在对公认答案的评论中问道:

Java和C示例使用了完全不同的寄存器名称。这两个例子都使用AMD64 ISA?

xor edx, edx
xor eax, eax
.L2:
mov ecx, edx
imul ecx, edx
add edx, 1
lea eax, [rax+rcx*2]
cmp edx, 1000000000
jne .L2

我没有足够的声誉在评论中回答这个问题,但这些都是相同的ISA。值得指出的是,GCC版本使用32位整数逻辑,而JVM编译版本内部使用64位整数逻辑。

R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.

使用Java 11并使用以下VM选项关闭循环展开的有趣观察:

-XX:LoopUnrollLimit=0

带有2 * (i * i)表达式的循环会产生更紧凑的本机代码1:

L0001: add    eax,r11d
       inc    r8d
       mov    r11d,r8d
       imul   r11d,r8d
       shl    r11d,1h
       cmp    r8d,r10d
       jl     L0001

与2 * I * I版本相比:

L0001: add    eax,r11d
       mov    r11d,r8d
       shl    r11d,1h
       add    r11d,2h
       inc    r8d
       imul   r11d,r8d
       cmp    r8d,r10d
       jl     L0001

Java版本:

java version "11" 2018-09-25
Java(TM) SE Runtime Environment 18.9 (build 11+28)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)

基准测试结果:

Benchmark          (size)  Mode  Cnt    Score     Error  Units
LoopTest.fast  1000000000  avgt    5  694,868 ±  36,470  ms/op
LoopTest.slow  1000000000  avgt    5  769,840 ± 135,006  ms/op

基准测试源代码:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(1)
public class LoopTest {

    @Param("1000000000") private int size;

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
            .include(LoopTest.class.getSimpleName())
            .jvmArgs("-XX:LoopUnrollLimit=0")
            .build();
        new Runner(opt).run();
    }

    @Benchmark
    public int slow() {
        int n = 0;
        for (int i = 0; i < size; i++)
            n += 2 * i * i;
        return n;
    }

    @Benchmark
    public int fast() {
        int n = 0;
        for (int i = 0; i < size; i++)
            n += 2 * (i * i);
        return n;
    }
}

1 -虚拟机选项使用:- xx:+UnlockDiagnosticVMOptions - xx:+PrintAssembly - xx:LoopUnrollLimit=0

虽然与问题的环境没有直接关系,但出于好奇,我在。net Core 2.1 x64发布模式上做了同样的测试。

这是一个有趣的结果,证实了类似的现象(相反)发生在原力的黑暗面。代码:

static void Main(string[] args)
{
    Stopwatch watch = new Stopwatch();

    Console.WriteLine("2 * (i * i)");

    for (int a = 0; a < 10; a++)
    {
        int n = 0;

        watch.Restart();

        for (int i = 0; i < 1000000000; i++)
        {
            n += 2 * (i * i);
        }

        watch.Stop();

        Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds} ms");
    }

    Console.WriteLine();
    Console.WriteLine("2 * i * i");

    for (int a = 0; a < 10; a++)
    {
        int n = 0;

        watch.Restart();

        for (int i = 0; i < 1000000000; i++)
        {
            n += 2 * i * i;
        }

        watch.Stop();

        Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");
    }
}

结果:

2 * (i * i)

结果:119860736,438 ms 结果:119860736,433 ms 结果:119860736,437 ms 结果:119860736,435毫秒 结果:119860736,436 ms 结果:119860736,435毫秒 结果:119860736,435毫秒 结果:119860736,439 ms 结果:119860736,436 ms 结果:119860736,437 ms

2 * I * I

结果:119860736,417毫秒 结果:119860736,417毫秒 结果:119860736,417毫秒 结果:119860736,418 ms 结果:119860736,418 ms 结果:119860736,417毫秒 结果:119860736,418 ms 结果:119860736,416毫秒 结果:119860736,417毫秒 结果:119860736,418 ms