【异构系统编程:基于GPU和加速器 2023】苏黎世联邦理工—中英字幕

Konten Video AsliPerluas Video
  • Welcoming participants to the first lecture of the course on heterogeneous systems.
  • Discussion of the course objectives, focusing on programming heterogeneous computing systems with GPUs and other accelerators.
  • Introduction to the background of heterogeneous systems and the importance of scaling performance and efficiency.
  • Explanation of different types of heterogeneous computing devices and their applications in various domains.
  • Overview of course requirements, expectations, and the importance of hands-on projects.

Okay, I think we are going to start in a few seconds. O it managed to unmute it. Can you hear? Yes, I can hear you now. Can you hear me? Yes, very clearly. Okay, that's very good. So then I think it's time to start now.

So, hello guys. Welcome. Thank you for attending this first lecture of our course on heterogeneous computing systems. The title, the entire title of this course is "Programming Heterogeneous Computing Systems with GPUs and Other Accelerators." And that's what we are going to learn in this course.

How to work with heterogeneous computing systems, especially with a special focus on GPUs and parallel programming. Before... So basically today what we are going to do is just introducing the main goals of the course.

We will give also a little bit of background on heterogeneous systems, as well as interesting devices that are used for different purposes. For example, ML and neural network accelerators that we will briefly mention over the course of this first lecture.

Starting from next Friday, we will have more in-depth lectures about the architecture of CD processors, GPUs, GPU programming, parallel patterns, etc. I will show you some materials later for your reference because you can already check what is going to be, more or less, the different collection of lectures that we are going to have in this course.

You can already check them on the website of the past semester. Feel free to stop me anytime. I will eventually ask you as well if you have any questions. And also feel free to use the chat here in Zoom. If you are attending the lecture on YouTube, you can also use the YouTube chat. I will keep an eye on it in case there are any questions there.

So before we go into details about the course, I think it's good to take a look at the contents of the course catalog with respect to this course because there is where the objective of this course is clearly defined.

I would say this is a description or this is the objective. The description of the objectives of this course motivates the difficulty of scaling performance and efficiency of CPUs, and this is something that affects many modern important workloads such as machine learning, artificial intelligence, but also other domains such as bioinformatics, graph processing, medical imaging, personalized medicine, etc.

So there is a need for heterogeneous devices, and good examples of devices other than CPUs that we can find in computing systems these days are graphics processing units, FPGAs, and also specialized accelerators. For example, the tensor processing units that are used for neural network training and inference, and even more newer trends such as near data processing architectures.

Even though there have been great advances in the adoption of heterogeneous systems in recent years, there are still many challenges to tackle. And that's what we are going to try to do in this course, at least partly.

So, one possible way of starting the discussion about these different heterogeneous systems is to take a look at the Flynn taxonomy of computers. Back in 1966, Mike Flynn already proposed a classification of computers where he divides them into four different main groups depending on how the devices operate on the data.

The first example is Single Instruction Single Data (SISD) where a single instruction operates on a single data element. This is like a sequential processor, same as we can have very simple CPUs with just one core.

SIMD is Single Instruction operating on Multiple Data. This is where we are going to start today and also in the next lecture discussing these devices or architectures. SIMD architectures, where with a single stream of instructions, we compute on multiple data elements.

We are going to pay special attention to this type because they are one of the basic architectures that compose graphics processing units. As I said earlier, graphics processing units are a key device and a key type of device in this course.

The third type of computer is MISD (Multiple Instructions Single Data Element). It's not that easy to find good examples in the real world, but the closest form could be systolic processors. We are going to mention them also in today's lecture and also in some other lectures later in this semester.

And finally, the last one is MIMD (Multiple Instructions Multiple Data). In this case, multiple instructions operate on multiple data elements. So we have multiple instruction streams. Examples of these can be multiprocessors, multicore processors, or other types of multithread processors.

As I said, we are going to start with the example of the heterogeneous processor device. The example today is going to be CD processors. In particular, we are going to start with the SIMD ISA extensions that we can find in multiple CPUs these days. Because even though CPUs can be multi-core, CPUs can be seen as a mini machine or a mini computer.

They are also heterogeneous in themselves because they have different types of processing elements. A good example of this is the SIMD extensions. So in the Single Instruction Multiple Data (SIMD) extensions, a single instruction operates on multiple pieces of data at once. This is useful in many different applications, for example, processing of graphics, that many more as we will see and we'll discuss in this lecture and over the entire course.

The basic idea in these extensions is to perform short arithmetic operations on multiple elements at the same time. For example, we can add four 8-bit numbers with a single instruction. The basic idea would be to start using one conventional 32-bit register, for example, as we can have in any CPU and divide this register into four parts of eight bits each.

The basic idea to be able to compute four additions at the same time is to modify the ALU to eliminate carries between eight-bit values. For example, the carry out of this bit 7 won't be propagated to the carry in of this bit 8.

And in this way, we would be operating on four elements at the same time without any interference between the four additions. So that's the basic idea in the SIMD ISA extensions that we can find in CPUs. A good example, and actually a seminal example of these, are the Intel Pentium MMX operations that were already implemented in the 90s.

The idea is to operate one single instruction on multiple data elements with graphics in mind. When these MMX instructions were proposed and first implemented in the 90s, as I said, they had multimedia and graphics operations in mind.

These MMX extensions provide different ways of operating on 64 bits. If you have a 64-bit register, you can divide it; you can just operate on 64-bit elements on a 64-bit quadword or you can divide it into two parts and operate on 32-bit double words or on four 16-bit words or on eight 8-bit bytes.

So it all depends on how you create connections between the carry out and the carrying of the next bit and where that carry stops in order to keep the different operations independent.

I want to give you an example of how to use MMX instructions this way. You can just see one first example of how to operate with a CPU machine with a parallel architecture. In this example, what we are going to do is to overlay the human in one image on top of the background in another image.

So here we have two images, X and Y. As you can see on the left side, we have the image of a woman on a blue-black background. What we want to do is replacing the blue background with a blossom background so that the new image looks like this on the right side.

Performing this operation on a sequential CPU is pretty straightforward. We just need to go one by one over all the pixels of the input image of the image X and check if the corresponding pixel is blue or not. If it's blue, then we are going to replace the pixel in the new image with the corresponding pixel of image Y, which is the blossom background.

As you see, if it's not blue, that means that the pixel belongs to the woman, to the image of the woman. So what we do is just copy the pixel from the woman to the final image, this new image that we have here on the right.

Because all the computations that we are performing here are exactly the same for all pixels, that means that we can operate on multiple pixels in parallel, just with a single instruction or with a single stream of instructions.

What we can do with the MMX extensions is to check several pixels at the same time, whether they are blue or not. To do so, we compare to an array or a register that contains many bytes, each of them with the value or with the color blue.

We are going to compare one by one these pixels here to generate a bitmask that we are going to use later to filter out the elements from the images X and Y that we really need to copy, really need to transfer to the new image.

In this part here, after having generated this mask that is stored in this MM1, we can filter out those elements of the blossom image that could be where the image of the woman herself is. And we filter out those that we don't need, and we only keep those that we really need that are those that will appear in the background.

We do the same for the woman's image. We filter out all the blue pixels and we just maintain the pixels that correspond to the woman's image. After that, we combine the resulting filter pixels from both original images and then we obtain pixels from the final image, this new image that contains the woman in front of the blossom background.

So this is more or less how the CPU machine would work. You can operate on multiple elements at the same time using a single instruction. But this is just an example of data parallelism that we can find in current computing systems.

As I said earlier, it's very common to find SIMD extensions in CPUs, in multicore CPUs, but even in the same system on a chip you can also find other devices like specialized accelerators, neural network accelerators, for example, or GPUs, or maybe even reconfigurable logic, for example, FPGAs.

And it's more and more frequent as well that we find ways of performing coherent communication across these different devices with specialized coherent interfaces or coherent interconnects that allow these different types of devices to have coherent copies of data, let's say, of the most values for the data that they need to use, regardless of what devices modify them.

All of that will be covered in more or less detail in this course. But for sure, we will talk about all these different components because they are important in current heterogeneous computing systems.

So, goals of this course. First of all, we are going to introduce the need for heterogeneity in current computing systems in order to achieve high performance and energy efficiency. You're going to get familiar with some of the different heterogeneous devices that are available in computing systems.

You're going to learn parallel programming and we will focus on GPU programming with CUDA or with OpenCL. But even though we will mainly target a specific architecture and a specific type of device, the most learnings that you will have about parallel computing patterns in this course are applicable to all parallel machines or to many parallel machines.

And we will, in fact, mention other examples when we are discussing and presenting the parallel patterns. You are going to learn as well how to do workload distribution and parallelization strategies, the parallel patterns that I was mentioning.

And you will work hands-on because this is a projects and seminar course. So here the goal is that each of you develops your own hands-on project. And there might be many different things like programming heterogeneous architectures, analyzing workloads, proposing scheduling or loading mechanisms, etc.

These are just like different potential directions for the projects. I will mention that later as well, but we are going to discuss the projects in the next lecture.

We find heterogeneity in many different levels. We have already seen a picture with an entire system on a chip with three different types of devices: CPU, GPU, FPGA. But even if we go and look inside each of these devices independently, for example, a GPU, we will see that there is also heterogeneity inside.

For example, this is the layout, this is the big picture of the Nvidia A100 GPU with 108 cores and 40 megabytes of L2 cache. If you look at all these cores here in the image, you see that they are more or less homogeneous, right? But if you go inside them, you'll see that they have different execution units for, e.g., different data types, integer floating point, double precision.

And we also find this area here, the tensor cores that are small cores specialized for machine learning and deep learning. They appeared first in the Volta architecture back in 2017, and since then they have evolved. This generation of A100 GPUs integrated new support for sparsity where they can filter zero entries and compress and operate on compressed matrices.

This way, they are more efficient. Because as you can imagine, if you are performing multiplications with zero entries, that result is going to be zero. So we are just wasting energy. We don’t need to perform that multiplication, right? And that's something that it's more and more common in neural networks these days.

Sparsity is more and more common in neural networks these days and that's why these specialized units start support some sort of sparsity in order to make compression and then computation more efficient.

In summary, what I wanted to say is that we see heterogeneity even inside the core of one specific GPU device. This is a more recent GPU. This is the H100 GPU that was presented in 2022. They keep including innovations inside the tensor cores.

In the fourth generation they also see more heterogeneity in the data types that are supported. These tensor cores support 32-bit floating-point operands, 16-bit floating-point operands in two different formats and also two different formats of 8-bit floating-point values.

These tensor cores are specialized for machine learning training and inference in neural networks or deep learning. But they are also not the only devices that have been or in this case execution units in one specific device that have been designed for neural networks or deep learning.

I think another really interesting device is this Cerebras Wafer Scale Engine, which is the largest ML accelerator chip because in reality, it's not one single chip. It's an entire wafer, as you can see in the picture.

You can compare it as well with the size of one of the largest Nvidia GPUs. In this case, Nvidia Titan V. You can see the difference in the number of transistors and also the entire area. This picture here corresponds to the Wafer Scale Engine from 2019.

Two years later, they announced a newer larger one with basically twice the number of cores, twice the number of transistors, and also much larger than the largest GPU at that time, which was Nvidia A100. This was around 2021.

As you see, this is a very interesting device, and a very exciting direction in the design of machine learning accelerators. If you're curious about the Cerebras Wafer Scale Engine, I can recommend you this talk, this Safari Life seminar that one of the Founders of Cerebras delivered last year in our group.

Another very good example of the accelerator design for neural network training, inference, machine learning, and artificial intelligence is the Tensor Processing Units (TPUs) that were first proposed or were first developed by Google back in 2016-2017.

As you can see here on the right-hand side, this is basically a systolic array. Remember that we mentioned systolic arrays in the beginning of the lecture as an example of MISD mapping. Here what we have is a flow of data coming from the left, e.g., images or input feature maps.

What the systolic array basically does is performing a matrix multiplication, multiplying and accumulating these accumulators here at the output. Why matrix multiplication? Because matrix multiplication is a key primitive for neural networks and many machine learning and artificial intelligence algorithms.

We are going to talk about matrix multiplication in detail in this course and about how to lower convolutions into matrix multiplications, which is a very common technique in training and inference of neural networks these days. This is why this operation is so important and why there are units that are specialized for these specific operation units and entire accelerators.

That was the first generation of the Google TPU. The second generation, one year later, multiplied the total… well, the number of chips by 4. This also increases the FLOPS, the floating point operations per second, very significantly. They also replaced the DDR3 memory with high bandwidth memory in order to have a higher bandwidth of access to data to be able to feed these fast systolic arrays with data coming from memory.

The TPUs keep improving over time. The third generation increases the number of matrix units of systolic arrays from 2, 4 per chip. This here on the left is the version 2. This one on the right is the version 3. Again they are doubling the teraflops per chip as well.

In the fourth generation, they achieved even a higher jump from 90 teraflops to 250 teraflops per chip in 2021, supporting new ML applications such as computer vision, natural language processing, recommendation systems, reinforcement learning, etc.

This is again the picture of the modern systolic array that I mentioned earlier or that we have seen earlier. Here, as you can see, is a little bit more detail of this systolic array, this matrix multiply unit. Observe how we have input coming from the left, this systolic data setup that is feeding input feature maps that might be coming from somewhere else from external memory.

Also, the weights coming from this DDR3 chips or HBM memory. The weights of the neural network coming from here, performing the matrix multiplication and accumulating the results in these accumulator registers. So just another example of a heterogeneous device.

As I said earlier, we may have heterogeneity even inside the same chip. I gave you the example of GPU cores, but here you can see another really nice example. This Xilinx Versal has three compute engines, three types of compute engines inside the same chip.

It has some scalar processors like regular CPUs. They have adaptable hardware that is similar to an FPGA and they are designed for latency-critical workloads. While the Versal also integrates some vector processing units like SIMD units that are specialized for signal processing, video and image processing, and other parallel workloads.

Neural networks and machine learning can also be good examples for those vector engines. Or here in this figure called intelligent engines or AIE. Artificial intelligence engines that are basically vector processors.

As you see, a very interesting picture here with the three compute engines: scalar cores on one side, in fact, two different types of dual core ARM CPUs, adaptable engines in the middle, an FPGA where you can have your custom memory hierarchy with the different SRAM-based memories that FPGAs can have.

And on the right-hand side, the intelligent engines that are vector or SIMD units specialized for more parallel and regular workloads. Okay, do you guys have any questions so far?

Okay, I can continue, and we can clarify anything that might be needed later. Okay, we were talking about different types of devices. Many of these devices and accelerators are specialized for important workloads.

Important workloads these days are machine learning, artificial intelligence, neural networks. Another interesting example of such an accelerator, such a processor is the Grok Tensor Streaming Processor presented in 2020. Here, as you see, it is designed for machine learning workloads that exhibit abundant parallelism.

The good thing about machine learning, or at least many of the machine learning workloads is that they have quite regular structures. So it's relatively easy to create… well, relatively easy; it must not be easy. But at least it's possible to create a deterministic processor with a producer-consumer stream programming model.

If you are performing matrix multiplications, you are performing dot product operations, you know what the exact sequence of operations that need to happen for that matrix multiplication to be executed. This is different from other workloads where you may have irregular memory accesses or indirect accesses that require you to first go to memory, fetch some pointers, and then depending on where these pointers go, go again into memory and bring another value, etc.

That happens when we have more indirect access patterns, more irregular behavior. If you think about the matrix multiplication, we are accessing rows and columns of matrices. The computation and the accesses to memory and the operations that we need to perform at any given point are quite predictable.

That's why GROK designed a deterministic processor where they can orchestrate the flow of data over all different units that the GROK system integrates. And here you can see already a pictorial representation of these units with matrix units similar to the systolic arrays that we have mentioned earlier, switch units that are used for communication, shuffling data from the matrix unit to the memory unit.

There are also vector units similar to the matrix unit but intended to operate on a single vector. So they are more similar to a SIMD unit, and so on. These are the different units that you can see inside the Grok Tensor Streaming Processor.

Actually, you can see here the die photo, with these two hemispheres that the graph processor is divided into. On the left-hand side, we see memory. We see switching unit to move data, to shuffle data around, and we have the MMX units, the multiplication units.

In the middle, there is a vector unit for SIMD operations. You can also see different interfaces and links for communication across chips. Also for communication with another external device, for example, a host CPU via the PCI Express.

GROK proposes an entire hierarchy, a scale-out hierarchy from the GROK chip to the card that may contain more than one chip. Inside the one node, there will be multiple cards, and these different nodes can all be put into the same rack in order to scale out, to be able to train larger neural networks. I think GROK is a very interesting device as well.

We may talk about it in more detail in a later lecture. For now, what I would recommend you is this Safari Life seminar that Oscar Menser from GROK delivered basically two months ago in our group. The talk is accessible from the Safari website and also you have here the link to the YouTube livestream.

It's still available. Another prime example is this Tesla Dojo from 2022. Tesla obviously needs very large systems to train neural networks, and they are very innovative in the design of new devices that can help in this purpose.

I think it’s pretty interesting, these Dojo training matrix or these Dojo training systems that are composed of different types of processors as well. Here, as you can see, a regular host CPU, they will also include this chip as we will see, sort of an interface processor to communicate between the CPU and the tiles.

These tiles are similar to a systolic array to perform matrix and matrix multiplication, matrix vector multiplications that are necessary for training neural network training and inference. Observe also that the different devices have different sorts of memories, heterogeneous memories as well.

CPU may have conventional DDRx DRAM while these chips have HBM memory that provides significantly higher bandwidth. Here you can see a picture of the Dojo Interface Processor with 32 gigabytes of high bandwidth memory, a total bandwidth of 800 gigabytes per second.

And this is the D1 chip, the tile itself that is used for training the neural networks. As you see, a really high number of teraflops for different supported data types: 16-bit floating point, 8-bit floating point, and 32-bit floating point, depending on what are the specific needs of the neural network that they are training.

Okay, more examples of heterogeneous architectures. And this is actually a trend that I mentioned in the very beginning: near data processing or processing in memory. That is a trend that tries to alleviate the data movement bottleneck.

What's the data movement bottleneck? It's the frequent need for communication between memory and storage and processing elements and compute units. One problem with that, and probably all of you are familiar with the von Neumann model where we have the processor on one side, the memory on another side.

In the middle, we have a communication unit. The problem that we have in most current systems these days is that this communication unit is very narrow. So because it's very narrow, it basically behaves like a funnel. And that creates a lot of contention when accessing data, when moving data from one side to the other.

One way to alleviate these data movement bottleneck is called processing in memory. As the name indicates, it basically consists of placing processors inside the memory. This can happen in many different ways. One nice example is this App Mem processing in DRAM engine from 2019 that basically modifies DDR4 chips.

As you can see here, these chips are modified to include small processors called DPU near each DRAM bank. This way, the host CPU can offload computation to the processors inside the chips, and these processors can access data locally, operate on the data in the memory arrays, and do it at much higher aggregated bandwidth.

Because you're going to get much more bandwidth if you have many processors operating inside these chips than if you just try to access the data through the data bus here that, as I said, behaves like a funnel. APPM is a French startup, a small company, but it's not the only company that is exploring and designing processing in memory devices.

Samsung, as one major DRAM vendor, announced already two years ago their first processing in memory architecture targeted at AI and machine learning. It's called FindiRAM or HBM-PAM. It's based on HBM memory.

What they basically did was modifying the design of the HBM2 memory layers in order to place a small processor called PCUs. These PCU blocks that you can see here in the die photo inside each two banks of DRAM cells.

These PCU blocks are small SIMD units that have higher bandwidth access to the nearby banks and they can perform multiplications and additions pretty fast. These multiplications and additions can be used to execute tensor-flow operations or matrix-vector multiplication or matrix-matrix multiplication that are needed for machine learning and artificial intelligence.

Samsung has other proposals, a different type of processing in memory system whose basis is DRAM based. What they do here is placing a small FPGA on top of the DRAM, and this FPGA will implement several processing elements that have high bandwidth access to these memory chips on the DIMM.

They can exploit rank level parallelism accessing all chips at the same time and this way increasing the overall bandwidth of the system, alleviating the data movement bottleneck in that sense, and accelerating workloads that would be memory-bound in the baseline system. The example here is recommendation systems that they tested in their first prototype.

But Samsung is also not the only major vendor that is exploring and designing processing in memory devices. SK Hynix is another major DRAM vendor that announced last year their GDDR6 AIM accelerator. Similar to Samsung's HBM-PIM proposal, they also place processing units near the memory banks.

These processing units are also specialized for multiply-and-accumulate operations that are needed for machine learning and artificial intelligence. So these are just some examples of heterogeneous computing systems.

As I said, we are going to have many or potentially we will have many different types of devices inside the same system on a chip or inside the same compute system. But it's important as well to find efficient ways of communicating between these different processors, different types of processors and also keep coherent data in their caches and in their scratchpads.

Thus, that's the role of the coherent interconnects. Examples of that are, for example, CAPI or OpenCAPI, or the most recent CXL (Compute Express Link).

In the traditional approach to a system with heterogeneous devices, we need to explicitly move data between the main memory of the host processor, the CPU, for example, to the accelerator, for example, an FPGA. That's the traditional approach and that's what these coherent interfaces try to optimize and try to make more seamlessly integrated.

In summary, easier to use by programmers and users. So that was the proposal from the first CAPI and CAPI 2, which were the first two interfaces that IBM proposed. But then OpenCAPI was a standard followed by several vendors to provide a coherent interconnect between different types of accelerators or different types of devices in a system.

It's not only OpenCAPI; other vendors, such as Intel, for example, started to integrate these coherent interconnects between their CPUs and their FPGAs to provide a unified view of memory and to allow tighter integration and finer-grain collaboration between different devices in the system.

The most recent trend and standard is CXL (Compute Express Link), which is the industry standard interconnect that is being followed by many vendors these days. It's currently in its 3.0 specification and there will be more to come.

It is really providing or standing to provide a very interesting range of possibilities for different devices in the system, from accelerators with memory or simply memory buffers in order to extend the effective memory capacity of a system.

There are indeed three types of protocols supported, depending on the sort of device and the type of communication we want to have between these different components in the system. Components like a network interface card, a regular accelerator such as a GPU or an FPGA, or simply memory buffers and memory extenders that allow a host processor to have access to much larger memory.

This Compute Express Link, as I said, is a really promising standard and hopefully will allow much more seamless and close integration between different devices, and much more fine collaboration and in summary, more performance, more energy efficiency and also better programmability as well.

Okay, so we are going to talk about all these types of devices and interconnects more over the course of the semester. For now, let's summarize a little bit what are the goals of the course, what the key takeaways of this course are.

The first one is aimed at improving your knowledge in computer architecture and heterogeneous systems, your technical skills in programming heterogeneous architectures, e.g., GPUs or FPGAs.

We also have proposed projects about FPGA programming. You will develop your critical thinking analysis and interact with a group of researchers that are the Safari Research Group. You will get familiar with the research directions in your group and basically in the landscape of heterogeneous computing systems.

Finally, you will improve your technical presentations because you will present your project at the end of the semester.

The key goal of the course is to learn how to take advantage of existing heterogeneous devices by programming them, analyzing workloads, proposing offloading, and scheduling techniques, etc. There are not many prerequisites in this course, but you are expected to have some good background in digital design and computer architecture.

If you want a refresher on these, I can recommend you to take a look at the contents of our website Professor Mutlu's DCA course that is taught every spring semester. We are currently starting the Spring 2023 semester, but you can find all the materials of the course on the website of the past spring semester.

You are expected to have familiarity with C/C++ programming. This is important for many different potential projects like programming architectures, doing simulation work, etc. So we expect that you can program. If you have also some basic knowledge on FPGA programming or GPU programming, that will be even better for sure.

What we expect from you is that you have an interest in computer architectures and computing paradigms, to discover why things work or do not work, and solve problems, making systems efficient and usable.

This course is delivered by me and the Safari Research Group. The Safari Research Group is led by Professor Noor Mulu, who has been a professor at PTH Zich since 2015. He has plenty of experience in research and teaching on computer architecture, computing systems, hardware, security, bioinformatics, etc.

The rest of the team comprises me. I’m the lead supervisor and I will teach most of the lectures— not all, but most of the lectures of this course. There are also other supervisors. They will be possibly your supervisors in your own personal project: Dr. Mohammed Alseer, Dr. Besad Salami, and Dr. Mohammed Saar, as well as Jewel Lind Deger, who is a PhD student of our group.

So you can, you will have a chance to meet us and to know from us as well. But you can also take a look at the entire Safari Research Group website. It's a pretty large group—around 40 researchers these days. We do a lot of work in research, teaching, and also dissemination activities in different ways.

For example, the newsletters that we publish from time to time—this is an old edition, but a new edition is coming and hopefully will be released soon. We also organize many other activities. I mentioned earlier a couple of Safari Lab seminars that I would recommend; you have a link here to all the Safari Seminar Series that we started already in 2021, the Safari Life seminars.

You can find them on our website as well, with links to the YouTube recordings and probably some cases the slides as well if you want to take a look, etc. In our group, we do research in many different things for computing systems, but we have a clear focus on computer architecture, hardware, software, co-design, and also bioinformatics applications.

In computer architecture, we cover basically all different components of computer architecture and computing systems, from persistent memory and storage to hybrid main memory heterogeneous processors and accelerators, etc.

Ok, I have this slide here for you guys to introduce yourselves, but I think that we can skip this for now. We will have a meeting next week where we will discuss the different projects and I think that can also be a good opportunity for you to introduce yourselves and also let us know what your main interests are. Because probably based on that we will find what's the perfect match in terms of a hands-on project for you to develop this semester.

Do you guys have any questions so far? Okay, so if there are no questions, let me continue with course requirements and expectations. Attendance is required for all meetings, but there is a lot of flexibility here.

In fact, this is probably one of the few live meetings that we are going to have. As you'll see, the lectures are going to be always in this slot Friday at 9 am— or most of the times, that will be in this slot Friday at 9 am, but it's likely that we'll have a pre-recorded lecture that you can watch on Fridays at 9 am or sometime later.

The lecture is going to be for your own use; it will be available forever. So just feel free to organize your schedule and time in the way that you prefer. But you are expected to attend the lectures, study the learning materials, and for sure carry out your own hands-on project that, as I said, we will discuss next week.

We also expect from you that you participate, ask questions, contribute thoughts and ideas, and also read relevant papers. It's important to read papers in order to understand how research works and also to get inspired on exploring new ideas. We will help you a lot in your projects.

You will have very tight supervision from us, from me, or from any one of the other supervisors in this course. We will push for good work that may eventually get published, if you really do good job and want to publish it as well, of course.

So, important materials you will find in the course website. Here, you can find the link information about the course, also all materials of the course, links to the lectures, links to the slides, etc. Check also your email for announcements that we might make from time to time, either via Moodle or with the direct email, as I already did until now.

Remember that there is also a Q and A forum in Moodle that you can use for any questions that you may have regarding the course, basically the course lectures. If you have questions about your own project, you can directly discuss them with your supervisor. You will meet your supervisor or supervisors regularly.

This is the link to the current website of the course, the Spring 2023 edition, and also a link to the YouTube livestream where we will have the playlist with all these lectures. I already provided you with some learning materials over email. You can find them also here in this slide.

They are recommended, very much recommended, of course. But there are, for example, here some introductory lectures to CD processors on GPUs and heterogeneous programming. We are going to cover these contents in later lectures in this course as well.

So I think it would be good for you to watch these lectures or take a look at the slides before the lectures of this semester start or continue so that you have already some background and that facilitates your later learning. But as I said, we will cover all these contents with quite a bit of detail this semester.

In next week's meeting, we are going to announce the projects. I already said that we will give you a description about these projects and you will have a chance to select a project. We will give you a couple of days to think about it and then express what your preferences are.

Based on that, we will make a decision about what's the right project for you. After that, you will start having one-to-one meetings with your supervisors in order to start developing the project. In next meetings, I mentioned this already.

We are going to have lectures about GPU programming, parallel patterns, also lectures about research works, and then individual meetings with your mentors. Finally, at the end of the semester, after you have developed your project, there will be a presentation of your work.

For the entire view of what's going to be more or less this semester, please take a look at the website from Fall 2022. There you have, as you can see on the right-hand side of this slide, links to all recordings, all slides, and at the very bottom of the website, you'll find more recommended materials, papers, etc.

Okay, so that's all about the introduction of the course. We are about to wrap up and finish today's lecture. If you have any questions right now, please, guys, let me know. If not, I'm going to start introducing what will be the contents of the next lecture.

Lecture number two: we will talk about exploiting data parallelism. We have already started talking about this. We started the lecture today with CD processors, and that's going to be the main topic of next lecture.

CD processors and GPUs remember that CD processors are one of the four different types of machines that Mike Flynn identified in his 1966 paper. SIMD is a machine where single instructions operate on multiple data elements at the same time.

There are two classic examples of SIMD machines: one is the array processor, and the other one is the vector processor. We are going to mention very briefly what are the key differences between these two, and we will see that real SIMD machines like, for example, GPUs have characteristics of these two types of SIMD processors: array and vector processors.

I already gave you the example of SIMD extensions in CPUs with the MMX example for image overlaying. The idea is that we are going to perform the same computation on every single pixel of an image, and because of that, we can take advantage of that and save instructions basically by having one single instruction operating on multiple data elements.

And I think this example of the image overlay must already be clear. In SIMD processing, we have one single instruction operating on multiple data elements, and we say that this is either in time or in space.

Depending on the type of SIMD processor that we consider, we have multiple processing elements or execution units, and this operation in time or space is called the time-space duality. For example, in an array processor, one instruction operates on different multiple data elements at the same time using different spaces or different processing elements, while in a vector processor, one instruction operates on multiple data elements in consecutive time steps using the same space or the same processing element.

I think this is clearly represented in this slide. Observe that on the left-hand side you have the instruction string: four different instructions—load, add, multiply, and store. Each of these instructions is a vector instruction.

It's one single instruction that operates on multiple data elements. But the way that you can do that is in different ways. You can have an array processor where you have multiple processing elements. And each of these processing elements is capable of performing different operations like load, add, multiply, and store.

In this array processor, we have the same operation happening at the same time and different array processors, as you see. Well, we can also have the vector processor where the execution units are specialized for load, add, multiply, or store.

Here, with a single instruction, you can operate on multiple data elements at different times using the same space. As you see, for example, this addition here operates on four elements, but it does it on four different cycles, and all of that happens in the same unit, this addition unit.

Observe how that is different from the array processor. But if you look at the internals of an Nvidia GPU, for example, the GPU core, you’ll see that they have parts of the array and parts of the C, part of array and part of vector processor.

Why is that? Because we have many different units, you see, specialized for 32-bit floating point, 32-load and store, 64-bit floating point, or even these tensor cores. But we also have several of these units so we can operate both in time and in space using individual instructions or a single instruction stream.

It's not easy to program, by the way, all these systems. Hopefully, in this course, we will help you to learn how to do that. But I think it’s really interesting, this reflection from Fisher in 1983 where he said that to program a vector machine, or in general CD machines, the compiler or the hand coder must make the data structure in the code fit nearly exactly the regular structure built into the hardware.

That's hard to do in the first place, and just as hard to change. So what this means is that if you have a parallel machine such as these SIMD processors that are quite static, because basically you have a certain number of execution engines, and these execution engines have certain capabilities, but they might not be very flexible.

You need to adapt your workload to the execution units that you have. So workloads might not be so regular in the real world, and even if they are regular, they may not have a size that is an exact multiple of the number of units that you have.

So programming these CD processors in an optimal way is quite challenging sometimes. There have been a lot of progress in recent times, in recent years on how to program these systems, and a good example of that is the CUDA programming model that we are going to cover extensively in this course.

But still, it is not that easy to do. And that's why it's important to attend this course. So, as we will see and continue discussing about this in our next lecture. In an heterogeneous system with a GPU, you will typically choose the best device for the different parts of your computation or your workload.

Typically, sequential or modestly parallel sections of code will run on the CPU, while massively parallel sections of code will run on the GPU. We’re going to have a single thread or very few threads of execution running on the serial code on the host processor, on the CPU, while we are going to have a parallel kernel running on the device, the GPU in this case utilizing multiple threads, multiple SIMD units and lanes at the same time.

When the parallel kernel finishes, control returns to the CPU that may execute more serial code and at some point launch a new parallel kernel onto the device, the GPU. In this example, you will see that writing GPU code, CUDA code, or OpenCL code is not that difficult.

But we definitely need to learn how to do it and change the mindset. Here you can see a very simple example of how to port a sequential CPU code that operates on three different arrays A and B and writes the result in array C in CUDA code.

What we do is kind of unrolling this entire loop and assigning one iteration of the loop to one individual thread. We’ll have thousands or millions of threads running on the GPU at the same time, and each of these threads is going to do relatively fine-grained computation because they are just accessing one element from A, one element from B, adding them, and storing the result in the output array C.

So as you see, we need to change the mindset when it comes to programming the system. We need to have this parallel computing mindset in order to be able to map computation onto all these millions or at least thousands of available threads that we have been in GPUs and other parallel devices.

Hopefully, even though Fisher said that this is hard to do and hard to change, we will learn how to do it, and we will learn how to optimize code for GPUs and other heterogeneous devices as the ones represented in this slide.

If you want to check some good examples of parallel programs and heterogeneous execution on CPU GPU FPGA, you can take a look at the CHAI benchmark suite, for example, we developed a few years ago with different collaboration patterns for, as I said, these different types of devices.

Different benchmarks with different types of collaboration patterns and also different versions of CHAI benchmarks and CUDA, OpenCL for CPU, for GPU, for FPGA, and even for simulator.

This is all for today. I am done with my part of the lecture. So guys, if you have any questions, please let me know right now. If not, I will send you an email sometime later today or this weekend in order to find a good time for everyone for meeting two so that we can discuss the projects, and know what your main interests are. Based on that, sometime next week, we can decide what's the right hands-on project for you.

Will we get a list of the available projects in general beforehand or will this be a surprise?
We will present that in the meeting, and after that, you will get the list in Moodle.

My main question is how much time are you going to have in order to work on this project?
So for the course itself, there is one lecture that shouldn't be longer than this one today, basically one hour at most. The amount of time you are expected to spend on the project is a few hours every week.

I think that this is something that you will discuss directly with your supervisors. A lot in this course depends on what your excitement is and how much you want to learn and how far you want to go. Because if, for example, you work with me on an implementation of an … algorithm on a GPU, you can just reach the ground by implementing a basic version with some well-optimized or at least reasonably well-optimized implementation.

Or you can play fancier optimizations, do more design space exploration, and obviously that will require more time, but it will also allow you to learn more. So a lot in this course depends on how excited you are and how much you really want to learn and how far you really want to go.

Thank you.

And one question. Do we need to submit by a deadline?
There is no deadline. The deadline is just the end of the semester, and we expect that you complete the project by the end of the semester.

And after that, someday, probably in late June or July, we will have a final meeting where you guys will present your different projects. But that also shouldn't be a very long presentation, maybe just a 10-15 minute presentation or so.

Okay, thank you for the questions. Anything else? Sounds good. So, if there is nothing else, I think that we are done for today. I really appreciate that you attended, and I hope to see you next week. Have good discussions.