The typical scenario for your memory usage is to store the source data and the processed data in the global memory. When a threadblock starts, it first copies all relevant parts into the shared memory before getting their parts into the registers.
Memory access latency also depends on your memory strategy. If you blindly access data you will get the worst performance possible.
The different memories are organized in so-called 'banks'. Each memory request for a bank can be handled in a single clock cycle. The number of banks in the shared memory equals the warp size. The memory speed can be increased by avoiding conflicting bank access inside a single warp.
To copy shared memory from or to global memory the fastest way is to 'align' your memory calls. This means that the first thread in a warp should access the first element in the bank of both the shared and global memory. The second thread the second element and so on. This call will be optimized into a single memory transfer instruction which copies the whole bank to the target memory in one go.