与Project Euler的速度比较：C vs Python vs Erlang vs Haskell

小开

看看这个博客。在过去的一年左右的时间里，他用Haskell和Python做了一些欧拉项目的问题，他通常发现haskell要快得多。我认为在这些语言中，它与你的流利程度和编码风格有更多的关系。

当谈到Python速度时，你使用了错误的实现！尝试pypy，对于这样的事情，你会发现它要快得多。

小开

问题1：erlang、python和haskell会因为使用任意长度的整数而降低速度吗？还是只要值小于MAXINT就不会？

这不太可能。关于Erlang和Haskell我不能说太多（好吧，也许下面会介绍一下Haskell），但我可以指出Python中的许多其他瓶颈。每次程序试图使用Python中的某些值执行操作时，它都应该验证这些值是否来自正确的类型，并且需要花费一点时间。你的factorCount函数只是在不同的时间分配一个带有range (1, isquare + 1)的列表，并且在运行时，malloc风格的内存分配比像你在C中那样使用计数器在范围上迭代慢得多。值得注意的是，factorCount()被多次调用，因此分配了很多列表。此外，让我们不要忘记Python是解释的，CPython解释器并不关注优化。

编辑：哦，好吧，我注意到你使用的是Python 3，所以range()不返回列表，而是返回生成器。在这种情况下，我关于分配列表的观点是错误的：该函数只是分配range对象，尽管效率低下，但不如分配包含大量项目的列表低效。

问题2：为什么haskell这么慢？是有一个编译器标志关闭了刹车，还是我的实现？（后者很有可能，因为haskell对我来说是一本有七个印记的书。）

你在使用拥抱吗？Hugs是一个相当慢的解释器。如果你正在使用它，也许你可以在GHC上获得更好的时间——但我只是在思考减压，一个好的Haskell编译器在幕后做的那种事情非常迷人，超出了我的理解：）

问题3：你能给我一些提示，如何在不改变我确定因素的方式的情况下优化这些实现吗？以任何方式优化：更好，更快，更“原生”的语言。

我会说你在玩一个无趣的游戏。了解各种语言最好的部分就是尽可能以最不同的方式使用它们：）但是我离题了，我对这一点没有任何建议。对不起，我希望有人能在这种情况下帮助你：）

问题4：我的函数实现是否允许LCO，从而避免在调用堆栈上添加不必要的帧？
据我所知，你只需要确保递归调用是返回值之前的最后一个命令。换句话说，像下面这样的函数可以使用这样的优化：
def factorial(n, acc=1):
if n > 1:
acc = acc * n
n = n - 1
return factorial(n, acc)
else:
return acc
但是，如果您的函数如下所示，则不会进行此类优化，因为在递归调用之后有一个操作（乘法）：
def factorial2(n):
if n > 1:
f = factorial2(n-1)
return f*n
else:
return 1
我将操作分离在一些局部变量中，以明确执行哪些操作。然而，最常见的是如下所示查看这些函数，但它们与我要表达的观点是等价的：
def factorial(n, acc=1):
if n > 1:
return factorial(n-1, acc*n)
else:
return acc


def factorial2(n):
if n > 1:
return n*factorial(n-1)
else:
return 1
请注意，是否进行尾递归由编译器/解释器决定。例如，如果我记得很清楚，Python解释器就不会这样做（我在示例中使用Python只是因为它的语法流畅）。无论如何，如果你发现奇怪的东西，例如带有两个参数的阶乘函数（其中一个参数的名称如acc、accumulator等），现在你知道人们为什么这样做了：）

小开

最佳答案

在x86_64Core2 Duo（2.5GHz）机器上使用GHC 7.0.3、gcc 4.4.6、Linux 2.6.29，对Haskell使用ghc -O2 -fllvm -fforce-recomp，对C使用gcc -O3 -lm进行编译。

您的C例程运行时间为8.4秒（比您的运行速度快，可能是因为-O3）
Haskell解决方案在36秒内运行（由于-O2标志）
您的factorCount'代码没有显式输入并且默认为Integer（感谢Daniel在这里纠正了我的误诊！）。使用Int给出显式类型签名（无论如何这是标准做法），时间更改为11.1秒
在factorCount'中，您不必要地调用了fromIntegral。虽然修复不会导致任何更改（编译器很聪明，对你来说很幸运）。
您使用mod，其中rem更快且足够。这将时间更改为8.5秒。
factorCount'不断应用两个永远不会改变的额外参数（number、sqrt）。工作器/包装器转换为我们提供了：

 $ time ./so
842161320


real    0m7.954s
user    0m7.944s
sys     0m0.004s

没错，7.95秒。始终如一地比C解快半秒。没有-fllvm标志，我仍然得到8.182 seconds，所以NCG后端在这种情况下也做得很好。

结论：Haskell很棒。

结果代码

factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
where square = sqrt $ fromIntegral number
isquare = floor square


factorCount' :: Int -> Int -> Int -> Int -> Int
factorCount' number sqrt candidate0 count0 = go candidate0 count0
where
go candidate count
| candidate > sqrt = count
| number `rem` candidate == 0 = go (candidate + 1) (count + 2)
| otherwise = go (candidate + 1) count


nextTriangle index triangle
| factorCount triangle > 1000 = triangle
| otherwise = nextTriangle (index + 1) (triangle + index + 1)


main = print $ nextTriangle 1 1

编辑：现在我们已经探讨过了，让我们来解决这些问题

问题1：erlang、python和haskell会因为使用任意长度的整数，或者只要值小于比MAXINT？

在Haskell中，使用Integer比Int慢，但慢多少取决于执行的计算。幸运的是（对于64位机器）Int就足够了。为了可移植性，你可能应该重写我的代码以使用Int64或Word64（C不是唯一具有long的语言）。

问题2：为什么haskell这么慢？有没有编译器标志关掉刹车还是我的实现？（后者相当可能哈斯克尔对我来说是一本有七个印章的书。）

问题3：你能给我一些如何优化这些的提示吗在不改变我确定因素的方式的情况下实现？以任何方式优化：更好，更快，更“原生”的语言。

这就是我上面的回答。答案是

0）通过-O2使用优化
1）尽可能使用快速（特别是：unbox able）类型
2）rem不是mod（一个经常被遗忘的优化）和
3）工人/包装器转换（也许是最常见的优化）。

问题4：我的功能实现是否允许LCO，因此避免在调用堆栈上添加不必要的帧？

是的，那不是问题所在。干得好，很高兴你考虑了这个。

小开

Erlang实现存在一些问题。作为以下的基线，我测量的未修改Erlang程序的执行时间为47.6秒，而C代码的执行时间为12.7秒。

如果你想运行计算密集型的Erlang代码，你应该做的第一件事是使用本机代码。使用erlc +native euler12编译将时间降低到41.3秒。然而，这比在这类代码上本机编译的预期加速比要低得多（仅15%），问题是你使用了-compile(export_all)。这对实验很有用，但所有函数都有可能从外部访问这一事实导致本机编译器非常保守。（正常的BEAM模拟器没有太大影响。）用-export([solve/0]).替换此声明会提供更好的加速比：31.5秒（比基线几乎35%）。

但是代码本身有一个问题：对于factorCount循环中的每次迭代，您执行以下测试：

factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;

一般来说，在相同代码的不同实现之间进行公平的比较可能很棘手，特别是如果算法是数值的，因为你需要确保它们实际上在做同样的事情。由于某处的某些类型转换而在一个实现中出现的轻微舍入错误可能导致它比另一个执行更多的迭代，即使两者最终都达到相同的结果。

为了消除这个可能的错误源（并在每次迭代中消除额外的测试），我重写了factorCount函数，如下所示，以C代码为模型：

factorCount (N) ->
Sqrt = math:sqrt (N),
ISqrt = trunc(Sqrt),
if ISqrt == Sqrt -> factorCount (N, ISqrt, 1, -1);
true          -> factorCount (N, ISqrt, 1, 0)
end.


factorCount (_N, ISqrt, Candidate, Count) when Candidate > ISqrt -> Count;
factorCount ( N, ISqrt, Candidate, Count) ->
case N rem Candidate of
0 -> factorCount (N, ISqrt, Candidate + 1, Count + 2);
_ -> factorCount (N, ISqrt, Candidate + 1, Count)
end.

这次重写，没有export_all和本机编译，给了我以下运行时间：

$ erlc +native euler12.erl
$ time erl -noshell -s euler12 solve
842161320


real    0m19.468s
user    0m19.450s
sys 0m0.010s

与C代码相比，这还不算太差：

$ time ./a.out
842161320


real    0m12.755s
user    0m12.730s
sys 0m0.020s

考虑到Erlang根本不适合编写数字代码，在这样的程序上只比C慢50%是相当不错的。

最后，关于你的问题：

<强>问题1：由于使用任意长度的整数，erlang、python和haskell的速度是否松散不是只要值小于MAXINT就可以了吗？

是的，在某种程度上。在Erlang中，没有“使用32/64位算术进行环绕”的说法，所以除非编译器可以证明整数的一些界限（通常不能），否则它必须检查所有计算，看看它们是否可以适合单个标记词，或者是否必须将它们转换为堆分配的bignum。即使在运行时的实践中没有使用bignum，也必须执行这些检查。另一方面，这意味着你知道如果你突然给它比以前更大的输入，算法永远不会因为意外的整数环绕而失败。

问题4：我的函数实现是否允许LCO，从而避免在调用堆栈上添加不必要的帧？

是的，您的Erlang代码在上次调用优化方面是正确的。

小开

查看您的Erlang实现。计时包括启动整个虚拟机、运行您的程序和停止虚拟机。我很确定设置和停止erlang虚拟机需要一些时间。

如果计时是在erlang虚拟机本身中完成的，结果会有所不同，因为在这种情况下，我们只有相关程序的实际时间。否则，我相信Erlang Vm的启动和加载过程所花费的总时间加上停止它（当你把它放在程序中时）都包含在你用来计时程序输出的方法的总时间中。考虑使用我们在虚拟机本身中计时程序时使用的erlang计时本身 timer:tc/1 or timer:tc/2 or timer:tc/3。这样，erlang的结果将排除启动和停止/杀死/停止虚拟机所需的时间。这就是我的推理，考虑一下，然后再试一次你的基准测试。

实际上，我建议我们尝试在这些语言的运行时内对程序（对于有运行时的语言）进行计时，以便获得精确的值。例如，C没有启动和关闭运行时系统的开销，Erlang、Python和Haskell也是如此（98%确定这一点-我接受更正）。所以（基于这个推理）我得出结论，这个基准测试不够精确 /fair对于运行在运行时系统之上的语言来说不够精确。让我们通过这些更改再次这样做。

编辑：此外，即使所有语言都有运行时系统，启动和停止它的开销也会不同。所以我建议我们从运行时系统中计时（对于这适用的语言）。众所周知，Erlang VM在启动时有相当大的开销！

小开

关于Python优化，除了使用PyPy（对于代码零更改的非常令人印象深刻的加速），您可以使用PyPy的翻译工具链编译符合RPython的版本，或者Cython构建扩展模块，在我的测试中，这两个模块都比C版本更快，使用Cython模块几乎快两倍。作为参考，我还包括C和PyPy基准测试结果：

C（使用gcc -O3 -lm编译）

% time ./euler12-c
842161320


./euler12-c  11.95s
user 0.00s
system 99%
cpu 11.959 total

pypy1.5

% time pypy euler12.py
842161320
pypy euler12.py
16.44s user
0.01s system
99% cpu 16.449 total

RPython（使用最新的PyPy版本，c2f583445aee）

% time ./euler12-rpython-c
842161320
./euler12-rpy-c
10.54s user 0.00s
system 99%
cpu 10.540 total

cython 0.15

% time python euler12-cython.py
842161320
python euler12-cython.py
6.27s user 0.00s
system 99%
cpu 6.274 total

RPython版本有几个关键更改。要翻译成独立程序，您需要定义您的target，在本例中是main函数。预计它将接受sys.argv作为它的唯一参数，并且需要返回一个int。您可以使用translate.py% translate.py euler12-rpython.py翻译它，它会转换为C并为您编译它。

# euler12-rpython.py


import math, sys


def factorCount(n):
square = math.sqrt(n)
isquare = int(square)
count = -1 if isquare == square else 0
for candidate in xrange(1, isquare + 1):
if not n % candidate: count += 2
return count


def main(argv):
triangle = 1
index = 1
while factorCount(triangle) < 1001:
index += 1
triangle += index
print triangle
return 0


if __name__ == '__main__':
main(sys.argv)


def target(*args):
return main, None

Cython版本被重写为扩展模块_euler12.pyx，我从普通的python文件中导入和调用它。_euler12.pyx本质上与您的版本相同，只是有一些额外的静态类型声明。setup.py具有使用python setup.py build_ext --inplace构建扩展的普通样板。

# _euler12.pyx
from libc.math cimport sqrt


cdef int factorCount(int n):
cdef int candidate, isquare, count
cdef double square
square = sqrt(n)
isquare = int(square)
count = -1 if isquare == square else 0
for candidate in range(1, isquare + 1):
if not n % candidate: count += 2
return count


cpdef main():
cdef int triangle = 1, index = 1
while factorCount(triangle) < 1001:
index += 1
triangle += index
print triangle


# euler12-cython.py
import _euler12
_euler12.main()


# setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext


ext_modules = [Extension("_euler12", ["_euler12.pyx"])]


setup(
name = 'Euler12-Cython',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)

老实说，我对RPython或Cython都没有什么经验，并且对结果感到惊喜。如果您使用CPython，在Cython扩展模块中编写CPU密集型代码似乎是优化程序的一种非常简单的方法。

小开

问题3：你能给我一些如何优化这些实现的提示吗在不改变我确定因素的方式的情况下？任何优化方法：更好，更快，更“原生”的语言。

C的实现不是最佳的（正如Thomas M. DuBuisson所暗示的那样），该版本使用64位整数（即长期数据类型）。稍后我会研究程序集列表，但是有一个有根据的猜测，在编译代码中存在一些内存访问，这使得使用64位整数显着变慢。正是那个或生成的代码（无论是你可以在SSE寄存器中容纳更少的64位整数还是将双精度舍入为64位整数的事实都更慢）。

这是修改后的代码（简单地将长期替换为int，并且我显式内联了factorCount，尽管我认为这对于gcc-O3没有必要）：

#include <stdio.h>
#include <math.h>


static inline int factorCount(int n)
{
double square = sqrt (n);
int isquare = (int)square;
int count = isquare == square ? -1 : 0;
int candidate;
for (candidate = 1; candidate <= isquare; candidate ++)
if (0 == n % candidate) count += 2;
return count;
}


int main ()
{
int triangle = 1;
int index = 1;
while (factorCount (triangle) < 1001)
{
index++;
triangle += index;
}
printf ("%d\n", triangle);
}

运行+计时它提供：

$ gcc -O3 -lm -o euler12 euler12.c; time ./euler12
842161320
./euler12  2.95s user 0.00s system 99% cpu 2.956 total

作为参考，Thomas在前面的回答中给出了haskell实现：

$ ghc -O2 -fllvm -fforce-recomp euler12.hs; time ./euler12                                                                                      [9:40]
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12 ...
842161320
./euler12  9.43s user 0.13s system 99% cpu 9.602 total

结论：没有从ghc中拿走任何东西，它是一个很棒的编译器，但gcc通常会生成更快的代码。

小开

使用Haskell，您真的不需要显式地考虑递归。

factorCount number = foldr factorCount' 0 [1..isquare] -
(fromEnum $ square == fromIntegral isquare)
where
square = sqrt $ fromIntegral number
isquare = floor square
factorCount' candidate
| number `rem` candidate == 0 = (2 +)
| otherwise = id


triangles :: [Int]
triangles = scanl1 (+) [1,2..]


main = print . head $ dropWhile ((< 1001) . factorCount) triangles

在上面的代码中，我用常见的列表操作替换了@Thomas答案中的显式递归。代码仍然做着完全相同的事情，而我们不用担心尾递归。在我的机器上，它运行（~7.49s）比@Thomas答案中的版本（~7.04s）慢6%，而@Raedwulf的C版本运行~3.15s。似乎GHC在过去一年中有所改进。

PS。我知道这是个老问题了，我是在google上搜索的（我忘了我在搜索什么，现在……）。只想评论一下关于LCO的问题，表达一下我对Haskell的总体感受。我想对最上面的答案发表评论，但是注释不允许代码阻塞。

小开

您的Haskell实现可以通过使用Haskell包中的一些函数来大大加快。在这种情况下，我使用了primes，它只是安装了'Cabal install primes'；）

import Data.Numbers.Primes
import Data.List


triangleNumbers = scanl1 (+) [1..]
nDivisors n = product $ map ((+1) . length) (group (primeFactors n))
answer = head $ filter ((> 500) . nDivisors) triangleNumbers


main :: IO ()
main = putStrLn $ "First triangle number to have over 500 divisors: " ++ (show answer)

时间：

您的原始程序：

PS> measure-command { bin\012_slow.exe }


TotalSeconds      : 16.3807409
TotalMilliseconds : 16380.7409

改进执行

PS> measure-command { bin\012.exe }


TotalSeconds      : 0.0383436
TotalMilliseconds : 38.3436

如您所见，这个在同一台机器上运行38毫秒，而您的机器在16秒内运行：）

编译命令：

ghc -O2 012.hs -o bin\012.exe
ghc -O2 012_slow.hs -o bin\012_slow.exe

小开

问题1：Erlang、Python和Haskell会因为使用任意长度的整数，或者只要值小于比MAXINT？

对于Erlang，第一个问题可以用否定的方式回答。最后一个问题可以通过适当地使用Erlang来回答，如：

http://bredsaal.dk/learning-erlang-using-projecteuler-net

由于它比您最初的C示例更快，我想会有许多问题，因为其他人已经详细介绍了。

这个Erlang模块在便宜的上网本上执行大约5秒……它使用erlang中的网络线程模型，因此演示了如何利用事件模型。它可以分布在许多节点上。它很快。不是我的代码。

-module(p12dist).
-author("Jannich Brendle, jannich@bredsaal.dk, http://blog.bredsaal.dk").
-compile(export_all).


server() ->
server(1).


server(Number) ->
receive {getwork, Worker_PID} -> Worker_PID ! {work,Number,Number+100},
server(Number+101);
{result,T} -> io:format("The result is: \~w.\~n", [T]);
_ -> server(Number)
end.


worker(Server_PID) ->
Server_PID ! {getwork, self()},
receive {work,Start,End} -> solve(Start,End,Server_PID)
end,
worker(Server_PID).


start() ->
Server_PID = spawn(p12dist, server, []),
spawn(p12dist, worker, [Server_PID]),
spawn(p12dist, worker, [Server_PID]),
spawn(p12dist, worker, [Server_PID]),
spawn(p12dist, worker, [Server_PID]).


solve(N,End,_) when N =:= End -> no_solution;


solve(N,End,Server_PID) ->
T=round(N*(N+1)/2),
case (divisor(T,round(math:sqrt(T))) > 500) of
true ->
Server_PID ! {result,T};
false ->
solve(N+1,End,Server_PID)
end.


divisors(N) ->
divisor(N,round(math:sqrt(N))).


divisor(_,0) -> 1;
divisor(N,I) ->
case (N rem I) =:= 0 of
true ->
2+divisor(N,I-1);
false ->
divisor(N,I-1)
end.

下面的测试是在Intel（R）Atom（TM）CPU N270@1.60GHz上进行的

~$ time erl -noshell -s p12dist start


The result is: 76576500.


^C


BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
(v)ersion (k)ill (D)b-tables (d)istribution
a


real    0m5.510s
user    0m5.836s
sys 0m0.152s

小开

我将“Jannich Brendle”版本修改为1000而不是500。并列出euler12.bin、euler12.erl、p12dist.erl.的结果。

zhengs-MacBook-Pro:workspace zhengzhibin$ time erl -noshell -s p12dist start
The result is: 842161320.


real    0m3.879s
user    0m14.553s
sys     0m0.314s
zhengs-MacBook-Pro:workspace zhengzhibin$ time erl -noshell -s euler12 solve
842161320


real    0m10.125s
user    0m10.078s
sys     0m0.046s
zhengs-MacBook-Pro:workspace zhengzhibin$ time ./euler12.bin
842161320


real    0m5.370s
user    0m5.328s
sys     0m0.004s
zhengs-MacBook-Pro:workspace zhengzhibin$

小开

#include <stdio.h>
#include <math.h>


int factorCount (long n)
{
double square = sqrt (n);
int isquare = (int) square+1;
long candidate = 2;
int count = 1;
while(candidate <= isquare && candidate <= n){
int c = 1;
while (n % candidate == 0) {
c++;
n /= candidate;
}
count *= c;
candidate++;
}
return count;
}


int main ()
{
long triangle = 1;
int index = 1;
while (factorCount (triangle) < 1001)
{
index ++;
triangle += index;
}
printf ("%ld\n", triangle);
}

gcc-lm euler. c

时间/a.out

2.79s用户0.00s系统99%cpu 2.794总

小开

只是为了好玩。以下是一个更“本机”的Haskell实现：

import Control.Applicative
import Control.Monad
import Data.Either
import Math.NumberTheory.Powers.Squares


isInt :: RealFrac c => c -> Bool
isInt = (==) <$> id <*> fromInteger . round


intSqrt :: (Integral a) => a -> Int
--intSqrt = fromIntegral . floor . sqrt . fromIntegral
intSqrt = fromIntegral . integerSquareRoot'


factorize :: Int -> [Int]
factorize 1 = []
factorize n = first : factorize (quot n first)
where first = (!! 0) $ [a | a <- [2..intSqrt n], rem n a == 0] ++ [n]


factorize2 :: Int -> [(Int,Int)]
factorize2 = foldl (\ls@((val,freq):xs) y -> if val == y then (val,freq+1):xs else (y,1):ls) [(0,0)] . factorize


numDivisors :: Int -> Int
numDivisors = foldl (\acc (_,y) -> acc * (y+1)) 1 <$> factorize2


nextTriangleNumber :: (Int,Int) -> (Int,Int)
nextTriangleNumber (n,acc) = (n+1,acc+n+1)


forward :: Int -> (Int, Int) -> Either (Int, Int) (Int, Int)
forward k val@(n,acc) = if numDivisors acc > k then Left val else Right (nextTriangleNumber val)


problem12 :: Int -> (Int, Int)
problem12 n = (!!0) . lefts . scanl (>>=) (forward n (1,1)) . repeat . forward $ n


main = do
let (n,val) = problem12 1000
print val

使用ghc -O3，在我的机器（1.73GHz Core i7）上始终运行0.55-0.58秒。

C版本的更有效的factorCount函数：

int factorCount (int n)
{
int count = 1;
int candidate,tmpCount;
while (n % 2 == 0) {
count++;
n /= 2;
}
for (candidate = 3; candidate < n && candidate * candidate < n; candidate += 2)
if (n % candidate == 0) {
tmpCount = 1;
do {
tmpCount++;
n /= candidate;
} while (n % candidate == 0);
count*=tmpCount;
}
if (n > 1)
count *= 2;
return count;
}

使用gcc -O3 -lm将long s更改为main中的int，这始终在0.31-0.35秒内运行。

如果你利用第n个三角形数=n*（n+1）/2的事实，并且n和（n+1）具有完全不同的素数因式分解，那么两者都可以运行得更快，因此每一半的因数数量可以相乘以找到整体的因数数量。以下内容：

int main ()
{
int triangle = 0,count1,count2 = 1;
do {
count1 = count2;
count2 = ++triangle % 2 == 0 ? factorCount(triangle+1) : factorCount((triangle+1)/2);
} while (count1*count2 < 1001);
printf ("%lld\n", ((long long)triangle)*(triangle+1)/2);
}

将c代码的运行时间减少到0.17-0.19秒，并且它可以处理更大的搜索——大于10000个因子在我的机器上大约需要43秒。我给感兴趣的读者留下了类似的haskell加速。

小开

我假设只有当涉及的数字有许多小因子时，因子的数量才会很大。所以我使用了托姆基德的优秀算法，但首先使用了一个永远不会太小的因子计数的近似值。这很简单：检查最多29个素数因子，然后检查剩余的数字，计算因子数量的上限。用它来计算因子数量的上限，如果这个数字足够高，计算因子的确切数量。

下面的代码不需要这个假设来保证正确性，但要快速。它似乎有效；只有大约十万分之一的数字给出了足够高的估计值，需要进行全面检查。

以下是代码：

// Return at least the number of factors of n.
static uint64_t approxfactorcount (uint64_t n)
{
uint64_t count = 1, add;


#define CHECK(d)                            \
do {                                    \
if (n % d == 0) {                   \
add = count;                    \
do { n /= d; count += add; }    \
while (n % d == 0);             \
}                                   \
} while (0)


CHECK ( 2); CHECK ( 3); CHECK ( 5); CHECK ( 7); CHECK (11); CHECK (13);
CHECK (17); CHECK (19); CHECK (23); CHECK (29);
if (n == 1) return count;
if (n < 1ull * 31 * 31) return count * 2;
if (n < 1ull * 31 * 31 * 37) return count * 4;
if (n < 1ull * 31 * 31 * 37 * 37) return count * 8;
if (n < 1ull * 31 * 31 * 37 * 37 * 41) return count * 16;
if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43) return count * 32;
if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47) return count * 64;
if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53) return count * 128;
if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59) return count * 256;
if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61) return count * 512;
if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61 * 67) return count * 1024;
if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61 * 67 * 71) return count * 2048;
if (n < 1ull * 31 * 31 * 37 * 37 * 41 * 43 * 47 * 53 * 59 * 61 * 67 * 71 * 73) return count * 4096;
return count * 1000000;
}


// Return the number of factors of n.
static uint64_t factorcount (uint64_t n)
{
uint64_t count = 1, add;


CHECK (2); CHECK (3);


uint64_t d = 5, inc = 2;
for (; d*d <= n; d += inc, inc = (6 - inc))
CHECK (d);


if (n > 1) count *= 2; // n must be a prime number
return count;
}


// Prints triangular numbers with record numbers of factors.
static void printrecordnumbers (uint64_t limit)
{
uint64_t record = 30000;


uint64_t count1, factor1;
uint64_t count2 = 1, factor2 = 1;


for (uint64_t n = 1; n <= limit; ++n)
{
factor1 = factor2;
count1 = count2;


factor2 = n + 1; if (factor2 % 2 == 0) factor2 /= 2;
count2 = approxfactorcount (factor2);


if (count1 * count2 > record)
{
uint64_t factors = factorcount (factor1) * factorcount (factor2);
if (factors > record)
{
printf ("%lluth triangular number = %llu has %llu factors\n", n, factor1 * factor2, factors);
record = factors;
}
}
}
}

这在大约0.7秒内找到了具有13824个因数的第14,753,024个三角形，在34秒内找到了具有61,440个因数的879,207,615个三角形数，在10分5秒内找到了具有138,240个因数的12,524,486,975个三角形数，在21分25秒内找到了具有172,032个因数的26,467,792,064个三角形数（2.4GHz Core2 Duo），因此此代码平均每个数字只需116个处理器周期。最后一个三角形数本身大于2^68，因此

小开

C++11，<20ms-在这里运行

我知道你想要一些技巧来帮助提高你的语言特定知识，但由于这里已经很好地介绍了这一点，我想我会为那些可能看过你的问题的数学评论的人添加一些上下文，等等，想知道为什么这段代码这么慢。

这个答案主要是提供上下文，希望帮助人们更容易地评估你的问题/其他答案中的代码。

这段代码只使用了几个（粗俗的）优化，与所使用的语言无关，基于：

每个Traingle数的形式为n（n+1）/2
n和n+1是互质的
除数的个数是一个乘法函数

#include <iostream>
#include <cmath>
#include <tuple>
#include <chrono>


using namespace std;


// Calculates the divisors of an integer by determining its prime factorisation.


int get_divisors(long long n)
{
int divisors_count = 1;


for(long long i = 2;
i <= sqrt(n);
/* empty */)
{
int divisions = 0;
while(n % i == 0)
{
n /= i;
divisions++;
}


divisors_count *= (divisions + 1);


//here, we try to iterate more efficiently by skipping
//obvious non-primes like 4, 6, etc
if(i == 2)
i++;
else
i += 2;
}


if(n != 1) //n is a prime
return divisors_count * 2;
else
return divisors_count;
}


long long euler12()
{
//n and n + 1
long long n, n_p_1;


n = 1; n_p_1 = 2;


// divisors_x will store either the divisors of x or x/2
// (the later iff x is divisible by two)
long long divisors_n = 1;
long long divisors_n_p_1 = 2;


for(;;)
{
/* This loop has been unwound, so two iterations are completed at a time
* n and n + 1 have no prime factors in common and therefore we can
* calculate their divisors separately
*/


long long total_divisors;                 //the divisors of the triangle number
// n(n+1)/2


//the first (unwound) iteration


divisors_n_p_1 = get_divisors(n_p_1 / 2); //here n+1 is even and we


total_divisors =
divisors_n
* divisors_n_p_1;


if(total_divisors > 1000)
break;


//move n and n+1 forward
n = n_p_1;
n_p_1 = n + 1;


//fix the divisors
divisors_n = divisors_n_p_1;
divisors_n_p_1 = get_divisors(n_p_1);   //n_p_1 is now odd!


//now the second (unwound) iteration


total_divisors =
divisors_n
* divisors_n_p_1;


if(total_divisors > 1000)
break;


//move n and n+1 forward
n = n_p_1;
n_p_1 = n + 1;


//fix the divisors
divisors_n = divisors_n_p_1;
divisors_n_p_1 = get_divisors(n_p_1 / 2);   //n_p_1 is now even!
}


return (n * n_p_1) / 2;
}


int main()
{
for(int i = 0; i < 1000; i++)
{
using namespace std::chrono;
auto start = high_resolution_clock::now();
auto result = euler12();
auto end = high_resolution_clock::now();


double time_elapsed = duration_cast<milliseconds>(end - start).count();


cout << result << " " << time_elapsed << '\n';
}
return 0;
}

我的台式机平均需要19毫秒，笔记本电脑平均需要80毫秒，这与我在这里看到的大多数其他代码相去甚远。

小开

更多关于C版本的数字和解释。显然这些年来没有人这样做。记得给这个答案投赞成票，这样每个人都可以看到和学习。

第一步：确定作者方案的基准

笔记本电脑规格：

CPU i3 M380（931 MHz-最大省电模式）
4GB内存
win7 64位
微软Visual Studio 2012终极版
Cygwin与gcc 4.9.3
python2.7.10

命令：

compiling on VS x64 command prompt > `for /f %f in ('dir /b *.c') do cl /O2 /Ot /Ox %f -o %f_x64_vs2012.exe`
compiling on cygwin with gcc x64   > `for f in ./*.c; do gcc -m64 -O3 $f -o ${f}_x64_gcc.exe ; done`
time (unix tools) using cygwin > `for f in ./*.exe; do  echo "----------"; echo $f ; time $f ; done`

.

----------
$ time python ./original.py


real    2m17.748s
user    2m15.783s
sys     0m0.093s
----------
$ time ./original_x86_vs2012.exe


real    0m8.377s
user    0m0.015s
sys     0m0.000s
----------
$ time ./original_x64_vs2012.exe


real    0m8.408s
user    0m0.000s
sys     0m0.015s
----------
$ time ./original_x64_gcc.exe


real    0m20.951s
user    0m20.732s
sys     0m0.030s

文件名：integertype_architecture_compiler.exe

参数类型目前与原始程序相同（稍后会详细介绍）
架构是x86还是x64取决于编译器设置
编译器是gcc还是vs2012

第二步：再次调查、改进和基准测试

VS比gcc快250%。两个编译器应该给出类似的速度。显然，代码或编译器选项有问题。让我们调查一下！

第一个兴趣点是整数类型。转换可能很昂贵，一致性对于更好的代码生成和优化很重要。所有整数都应该是相同的类型。

现在是int和long的混合混乱。我们将改进它。使用什么类型？最快的。必须对它们进行基准测试！

----------
$ time ./int_x86_vs2012.exe


real    0m8.440s
user    0m0.016s
sys     0m0.015s
----------
$ time ./int_x64_vs2012.exe


real    0m8.408s
user    0m0.016s
sys     0m0.015s
----------
$ time ./int32_x86_vs2012.exe


real    0m8.408s
user    0m0.000s
sys     0m0.015s
----------
$ time ./int32_x64_vs2012.exe


real    0m8.362s
user    0m0.000s
sys     0m0.015s
----------
$ time ./int64_x86_vs2012.exe


real    0m18.112s
user    0m0.000s
sys     0m0.015s
----------
$ time ./int64_x64_vs2012.exe


real    0m18.611s
user    0m0.000s
sys     0m0.015s
----------
$ time ./long_x86_vs2012.exe


real    0m8.393s
user    0m0.015s
sys     0m0.000s
----------
$ time ./long_x64_vs2012.exe


real    0m8.440s
user    0m0.000s
sys     0m0.015s
----------
$ time ./uint32_x86_vs2012.exe


real    0m8.362s
user    0m0.000s
sys     0m0.015s
----------
$ time ./uint32_x64_vs2012.exe


real    0m8.393s
user    0m0.015s
sys     0m0.015s
----------
$ time ./uint64_x86_vs2012.exe


real    0m15.428s
user    0m0.000s
sys     0m0.015s
----------
$ time ./uint64_x64_vs2012.exe


real    0m15.725s
user    0m0.015s
sys     0m0.015s
----------
$ time ./int_x64_gcc.exe


real    0m8.531s
user    0m8.329s
sys     0m0.015s
----------
$ time ./int32_x64_gcc.exe


real    0m8.471s
user    0m8.345s
sys     0m0.000s
----------
$ time ./int64_x64_gcc.exe


real    0m20.264s
user    0m20.186s
sys     0m0.015s
----------
$ time ./long_x64_gcc.exe


real    0m20.935s
user    0m20.809s
sys     0m0.015s
----------
$ time ./uint32_x64_gcc.exe


real    0m8.393s
user    0m8.346s
sys     0m0.015s
----------
$ time ./uint64_x64_gcc.exe


real    0m16.973s
user    0m16.879s
sys     0m0.030s

整数类型是intlongint32_tuint32_tint64_t和uint64_t从#include <stdint.h>

C中有很多整数类型，还有一些有符号/无符号可供使用，还有编译为x86或x64的选择（不要与实际的整数大小混淆）。那是很多版本要编译和运行^^

第三步：理解数字

明确结论：

32位整数比64位整数快约200%
无符号64位整数比64位签名快25%（不幸的是，我对此没有解释）

技巧问题：“C中int和long的大小是多少？”
正确答案：C中int和long的大小没有明确定义！

从C规范：

int至少32位
long至少是一个int

从gcc手册页（-m32和-m64标志）：

32位环境将int、long和指针设置为32位，并生成在任何i386系统上运行的代码。
64位环境将int设置为32位，long和指针设置为64位，并为AMD的x86-64架构生成代码。

从MSDN留档（数据类型范围）https://msdn.microsoft.com/en-us/library/s3f49ktz%28v=vs.110%29.aspx：

int，4个字节，也称为有符号
long，4字节，也称为long int和有符号long int

总结一下：经验教训

32位整数比64位整数快。

标准整数类型在C和C++中没有很好的定义，它们因编译器和架构而异。当您需要一致性和可预测性时，请使用#include <stdint.h>中的uint32_t整数系列。

速度问题解决了。所有其他语言都落后了百分之一百，C&C++又赢了！他们总是这样。下一个改进将是使用OpenMP： D的多线程

小开

尝试去：

package main


import "fmt"
import "math"


func main() {
var n, m, c int
for i := 1; ; i++ {
n, m, c = i * (i + 1) / 2, int(math.Sqrt(float64(n))), 0
for f := 1; f < m; f++ {
if n % f == 0 { c++ }
}
c *= 2
if m * m == n { c ++ }
if c > 1001 {
fmt.Println(n)
break
}
}
}

我得到：

原c版本：9.1690百分之百
go： 8.2520111%

但使用：

package main


import (
"math"
"fmt"
)


// Sieve of Eratosthenes
func PrimesBelow(limit int) []int {
switch {
case limit < 2:
return []int{}
case limit == 2:
return []int{2}
}
sievebound := (limit - 1) / 2
sieve := make([]bool, sievebound+1)
crosslimit := int(math.Sqrt(float64(limit))-1) / 2
for i := 1; i <= crosslimit; i++ {
if !sieve[i] {
for j := 2 * i * (i + 1); j <= sievebound; j += 2*i + 1 {
sieve[j] = true
}
}
}
plimit := int(1.3*float64(limit)) / int(math.Log(float64(limit)))
primes := make([]int, plimit)
p := 1
primes[0] = 2
for i := 1; i <= sievebound; i++ {
if !sieve[i] {
primes[p] = 2*i + 1
p++
if p >= plimit {
break
}
}
}
last := len(primes) - 1
for i := last; i > 0; i-- {
if primes[i] != 0 {
break
}
last = i
}
return primes[0:last]
}






func main() {
fmt.Println(p12())
}
// Requires PrimesBelow from utils.go
func p12() int {
n, dn, cnt := 3, 2, 0
primearray := PrimesBelow(1000000)
for cnt <= 1001 {
n++
n1 := n
if n1%2 == 0 {
n1 /= 2
}
dn1 := 1
for i := 0; i < len(primearray); i++ {
if primearray[i]*primearray[i] > n1 {
dn1 *= 2
break
}
exponent := 1
for n1%primearray[i] == 0 {
exponent++
n1 /= primearray[i]
}
if exponent > 1 {
dn1 *= exponent
}
if n1 == 1 {
break
}
}
cnt = dn * dn1
dn = dn1
}
return n * (n - 1) / 2
}

我得到：

原c版本：9.1690百分之百
ThaumKids的c版本：0.10608650%
First go版本：8.2520111%
第二个go版本：0.023039865%

我还尝试了Python3.6和pypy3.3-5.5-alpha：

原c版本：8.629百分之百
ThaumKids的c版本：0.1097916%
Python3.6:54.79516%
pypy3.3-5.5-alpha： 13.29165%

然后使用以下代码我得到了：

原c版本：8.629百分之百
ThaumKids的c版本：0.1098650%
Python3.6:1.489580%
pypy3.3-5.5-alpha： 0.5821483%

def D(N):
if N == 1: return 1
sqrtN = int(N ** 0.5)
nf = 1
for d in range(2, sqrtN + 1):
if N % d == 0:
nf = nf + 1
return 2 * nf - (1 if sqrtN**2 == N else 0)


L = 1000
Dt, n = 0, 0


while Dt <= L:
t = n * (n + 1) // 2
Dt = D(n/2)*D(n+1) if n%2 == 0 else D(n)*D((n+1)/2)
n = n + 1


print (t)