The key of parallelism is to use multiple threads to solve a problem (duh.) but there are some differences to classical multithreaded programming in how threads are organized.
First lets talk about your typical GPU, for simplicities sake I'll focus on
A GPU has many processing cores, which make it ideal to execute many threads in parallel. Those cores are organized in Streaming Processors (SM, NVidia term), of which a GPU has a given number.
All threads running inside a SM are called a 'thread block'. There can be more threads on an SM than it has cores. The number of cores defines the so called 'Warp size' (NVidia term). Threads inside a thread block are sheduled in so called 'warps'.
A quick example to follow up: A typical NVidia SM has 32 processing cores, thus its warp size is 32. If my thread block now has 128 threads to run, they will be shedulled in 4 warps (4 warps * 32 warp size = 128 threads).
The warp size is rather important when choosing the number of threads later on.
All threads inside a single warp share a single instruction counter. That means those 32 threads are truly synchronized in that every thread executes every command at the same time. Here lies a performance pitfall: This also applies to branching statements in your kernel!
Example: I have a kernel that has an if statement and two branches. 16 of my threads inside a warp will execute branch one, the other 16 branch two. Up until the if statement, all threads inside the warp are in sync. Now half of them choose a different branch. What happens is that the other half will lay dormant until the wrong statement has finished executing on the first 16 threads. Then those threads will be dormant until the other 16 threads finished their branch.
As you can see, bad branching habits can severely slow down your parallel code, because both statements get executed in the worst case. If all threads inside a warp decide they only need one of the statements, the other one is completely skipped and no delay occurs.
Syncing threads is also not a simple matter. You can only sync threads withing a single SM. Everything outside the SM is unsyncable from inside the kernel. You'll have to write seperate kernels and launch them one after the other.