NAMA : Merdeka Putra
NPM : 56414592
KELAS : 4IA22
KELOMPOK : 5
MATERI : Parallel Computing
Parallel computing is a type of computing architecture in which several processors execute or process an application or computation simultaneously. Parallel computing helps in performing large computations by dividing the workload between more than one processor, all of which work through the computation at the same time. Most supercomputers employ parallel computing principles to operate.
Parallel computing is also known as parallel processing.
- von Neumann Architecture
- Named after the Hungarian mathematician/genius John von Neumann who first authored the general requirements for an electronic computer in his 1945 papers.
- Also known as "stored-program computer" - both program instructions and data are kept in electronic memory. Differs from earlier computers which were programmed through "hard wiring".
- Since then, virtually all computers have followed this basic design:
- Comprised of four main components:
- Memory
- Control Unit
- Arithmetic Logic Unit
- Input/Output
- Read/write, random access memory is used to store both program instructions and data
- Program instructions are coded data which tell the computer to do something
- Data is simply information to be used by the program
- Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task.
- Arithmetic Unit performs basic arithmetic operations
- Input/Output is the interface to the human operator
2. Flynn's Classical Taxonomy
- There are different ways to classify parallel computers.
- One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy.
- Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of Instruction Stream and Data Stream. Each of these dimensions can have only one of two possible states: Single or Multiple.
- The matrix below defines the 4 possible classifications according to Flynn:
3. Some General Parallel Terminology
- Like everything else, parallel computing has its own "jargon". Some of the more commonly used terms associated with parallel computing are listed below.
- Most of these will be discussed in more detail later.
Supercomputing / High Performance Computing (HPC)
Using the world's fastest and largest computers to solve large problems.
Node
A standalone "computer in a box". Usually comprised of multiple CPUs/processors/cores,
memory, network interfaces, etc. Nodes are networked together to comprise a
supercomputer.
CPU / Socket / Processor / Core
This varies, depending upon who you talk to. In the past, a CPU (Central Processing Unit)
was a singular execution component for a computer. Then, multiple CPUs were
incorporated into a node. Then, individual CPUs were subdivided into multiple "cores",
each being a unique execution unit. CPUs with multiple cores are sometimes called
"sockets" - vendor dependent. The result is a node with multiple CPUs, each containing
multiple cores. The nomenclature is confused at times. Wonder why?
Task
A logically discrete section of computational work. A task is typically a program or program
like set of instructions that is executed by a processor. A parallel program consists of multiple
tasks running on multiple processors.
Pipelining
Breaking a task into steps performed by different processor units, with inputs streaming
through, much like an assembly line; a type of parallel computing.
Shared Memory
From a strictly hardware point of view, describes a computer architecture where all
processors have direct (usually bus based) access to common physical memory. In a
programming sense, it describes a model where parallel tasks all have the same "picture" of
memory and can directly address and access the same logical memory locations regardless of
where the physical memory actually exists.
Symmetric Multi-Processor (SMP)
Shared memory hardware architecture where multiple processors share a single address space
and have equal access to all resources.
Distributed Memory
In hardware, refers to network based memory access for physical memory that is not
common. As a programming model, tasks can only logically "see" local machine memory
and must use communications to access memory on other machines where other tasks are
executing.
Communications
Parallel tasks typically need to exchange data. There are several ways this can be
accomplished, such as through a shared memory bus or over a network, however the actual
event of data exchange is commonly referred to as communications regardless of the method
employed.
Synchronization
The coordination of parallel tasks in real time, very often associated with communications.
Often implemented by establishing a synchronization point within an application where a
task may not proceed further until another task(s) reaches the same or logically equivalent
point.
Synchronization usually involves waiting by at least one task, and can therefore cause a
parallel application's wall clock execution time to increase.
Granularity
In parallel computing, granularity is a qualitative measure of the ratio of computation to
communication.
Coarse: relatively large amounts of computational work are done between communication
events
Fine: relatively small amounts of computational work are done between communication
events
Observed Speedup
Parallel Overhead
The amount of time required to coordinate parallel tasks, as opposed to doing useful work.
Parallel overhead can include factors such as:
- Task start-up time
- Synchronizations
- Data communications
- Software overhead imposed by parallel languages, libraries, operating system, etc.
- Task termination time
Massively Parallel
Refers to the hardware that comprises a given parallel system - having many processing
elements. The meaning of "many" keeps increasing, but currently, the largest parallel
computers are comprised of processing elements numbering in the hundreds of thousands to
millions.
Embarrassingly Parallel
Solving many similar, but independent tasks simultaneously; little to no need for coordination
between the tasks.
Scalability
Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate
increase in parallel speedup with the addition of more resources. Factors that contribute to
scalability include:
- Hardware - particularly memory-cpu bandwidths and network communication properties
- Application algorithm
- Parallel overhead related
- Characteristics of your specific application
4. Limits and Costs of Parallel Programming
Amdahl's Law:
- Amdahl's Law states that potential program speedup is defined by the fraction of code (P) that can be parallelized.
- If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup).
- If all of the code is parallelized, P = 1 and the speedup is infinite (in theory).
- If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast.
- Introducing the number of processors performing the parallel fraction of work
- It soon becomes obvious that there are limits to the scalability of parallelism. For example:
- However, certain problems demonstrate increased performance by increasing the problem size.
- Problems that increase the percentage of parallel time with their size are more scalable than problems with a fixed percentage of parallel time.
Complexity:
- In general, parallel applications are much more complex than corresponding serial applications, perhaps an order of magnitude. Not only do you have multiple instruction streams executing at the same time, but you also have data flowing between them.
- The costs of complexity are measured in programmer time in virtually every aspect of the software development cycle:
- Design
- Coding
- Debugging
- Tuning
- Maintenance
- Adhering to "good" software development practices is essential when working with parallel applications - especially if somebody besides you will have to work with the software.
Portability:
- Thanks to standardization in several APIs, such as MPI, POSIX threads, and OpenMP, portability issues with parallel programs are not as serious as in years past. However...
- All of the usual portability issues associated with serial programs apply to parallel programs. For example, if you use vendor "enhancements" to Fortran, C or C++, portability will be a problem.
- Even though standards exist for several APIs, implementations will differ in a number of details, sometimes to the point of requiring code modifications in order to effect portability.
- Operating systems can play a key role in code portability issues.
- Hardware architectures are characteristically highly variable and can affect portability.
Resource Requirements:
- The primary intent of parallel programming is to decrease execution wall clock time, however in order to accomplish this, more CPU time is required. For example, a parallel code that runs in 1 hour on 8 processors actually uses 8 hours of CPU time.
- The amount of memory required can be greater for parallel codes than serial codes, due to the need to replicate data and for overheads associated with parallel support libraries and subsystems.
- For short running parallel programs, there can actually be a decrease in performance compared to a similar serial implementation. The overhead costs associated with setting up the parallel environment, task creation, communications and task termination can comprise a significant portion of the total execution time for short runs.
Scalability:
- Two types of scaling based on time to solution: strong scaling and weak scaling.
- Strong scaling:
- The total problem size stays fixed as more processors are added.
- Goal is to run the same problem size faster
- Perfect scaling means problem is solved in 1/P time (compared to serial
- Weak scaling:
- The problem size per processor stays fixed as more processors are added.
- Goal is to run larger problem in same amount of time
- Perfect scaling means problem Px runs in same time as single processor run
- The ability of a parallel program's performance to scale is a result of a number of interrelated factors. Simply adding more processors is rarely the answer.
- The algorithm may have inherent limits to scalability. At some point, adding more resources causes performance to decrease. This is a common situation with many parallel applications.
- Hardware factors play a significant role in scalability. Examples:
- Memory-cpu bus bandwidth on an SMP machine
- Communications network bandwidth
- Amount of memory available on any given machine or set of machines
- Processor clock speed
- Parallel support libraries and subsystems software can limit scalability independent of your application.
B. Distributed Processing
Distributed processing is a setup in which multiple individual central processing units (CPU) work on the same programs, functions or systems to provide more capability for a computer or other device.
Distributed processing is a setup in which multiple individual central processing units (CPU) work on the same programs, functions or systems to provide more capability for a computer or other device.
C. Architectural Parallel Computing
1.Parallel Computer Memory Architectures
Distributed Memory
General Characteristics:
- Like shared memory systems, distributed memory systems vary widely but share a common characteristic. Distributed memory systems require a communication network to connect inter-processor memory.
- Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors.
- Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply.
- When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility.
- The network "fabric" used for data transfer varies widely, though it can be as simple as Ethernet.
Advantages:
- Memory is scalable with the number of processors. Increase the number of processors and the size of memory increases proportionately.
- Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain global cache coherency.
- Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages:
- The programmer is responsible for many of the details associated with data communication between processors.
- It may be difficult to map existing data structures, based on global memory, to this memory organization.
- Non-uniform memory access times - data residing on a remote node takes longer to access than node local data.
2. Hybrid Distributed-Shared Memory
General Characteristics:
- The largest and fastest computers in the world today employ both shared and distributed memory architectures.
- The shared memory component can be a shared memory machine and/or graphics processing units (GPU).
- The distributed memory component is the networking of multiple shared memory/GPU machines, which know only about their own memory - not the memory on another machine. Therefore, network communications are required to move data from one machine to another.
- Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the foreseeable future.
Advantages and Disadvantages:
- Whatever is common to both shared and distributed memory architectures.
- Increased scalability is an important advantage
- Increased programmer complexity is an important disadvantage
D. Introduction Thread Programming
A thread in computer programming is a related information about the use of a single program that can handle multiple users simultaneously.Thread This allows the program to know how the user into the program in turn and the user will re-enter using a different user. Multiple threads can run simultaneously with other processes dividing resources into memory, while other processes do not share them.
E. Introduction Message Pasing, Open MP
Message Passing is a form of communication used in parallel computing, OOT (Object Oriented Programming) or Object-Based Programming and interprocess communication
F. Introduction CUDA GPU Programming
GPU is a special processor to speed up and change memory to speed up image processing. GPU itself is usually located in the computer graphics card or laptopCUDA (Compute Unified Device Architecture) is a scheme created by NVIDIA for NVIDIA as a GPU (Graphic Processing Unit) capable of computing not only for graphics processing but also for general purpose. So with the CUDA we can take advantage of many processors from NVIDIA to perform the calculation process or even computing a lot.