On a 8 cores Linux machine using GCC version 4.4, the C codes can be compiled and run the following way:
$ gcc -std=c99 -O3 -fopenmp loop.c -o loopc -lm
$ OMP_NUM_THREADS=1 ./loopc
Computing 1000000 cosines and summing them with 1 threads took 0.095832s
$ OMP_NUM_THREADS=2 ./loopc
Computing 1000000 cosines and summing them with 2 threads took 0.047637s
$ OMP_NUM_THREADS=4 ./loopc
Computing 1000000 cosines and summing them with 4 threads took 0.024498s
$ OMP_NUM_THREADS=8 ./loopc
Computing 1000000 cosines and summing them with 8 threads took 0.011785s
For the Fortran version, it gives:
$ gfortran -O3 -fopenmp loop.f90 -o loopf
$ OMP_NUM_THREADS=1 ./loopf
Computing 1000000 cosines and summing them with 1 threads took 0.0915s
$ OMP_NUM_THREADS=2 ./loopf
Computing 1000000 cosines and summing them with 2 threads took 0.0472s
$ OMP_NUM_THREADS=4 ./loopf
Computing 1000000 cosines and summing them with 4 threads took 0.0236s
$ OMP_NUM_THREADS=8 ./loopf
Computing 1000000 cosines and summing them with 8 threads took 0.0118s