Path: newssvr16.news.prodigy.com!newssvr06.news.prodigy.com!newsmst01.news.prodigy.com!prodigy.com!postmaster.news.prodigy.com!newssvr17.news.prodigy.com.POSTED!97296ea1!not-for-mail From: "Andy Glew" Newsgroups: comp.arch References: <3486820e.0305112321.1b4c3b26@posting.google.com> Subject: Re: Does PowerPC 970 has Tagged TLBs (Address Space Identifiers) Lines: 126 X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2800.1106 X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Message-ID: NNTP-Posting-Host: 63.203.205.229 NNTP-Posting-Date: Mon, 12 May 2003 12:37:52 EDT Organization: Prodigy Internet http://www.prodigy.com Date: Mon, 12 May 2003 16:37:52 GMT Xref: newsmst01.news.prodigy.com comp.arch:123734 MIME-Version: 1.0 Content-Type: text/plain > As you know a tagged TLB is useful for improving performance in > context switches so that it's not required to refill the entire TLB. > When a context switch happens, the new context does not have to refill > the TLB entries that their ASID is the same as its ASID. > > The only processor that has a tagged TLB atchitecture and I'm aware of > it is MIPS. Tagged TLBs are key to the success of microkernel based > OSes. Tagged TLBs are one way of allowing TLB entries to persist across context switches. However, TLBs tagged with process IDs do not help shared libraries or data that are shared between some, but not all, processes. I'm tempted to say that process-ID TLB tagging is now known to be a dead end wrt instruction set architecture, because of these other alternatives: (0) Process ID tagged TLBs (1) Folded Address Space (2) Object ID tagged TLBs (3) Snoopy TLBs (0) - the original poster probably understands them. TLB entries are tagged with a process ID, and are only hit if the current process ID matches that in the TLB. Usually, there is a kluge to allow the process ID tag for kernel TLB entries to be ignored, so that kernel TLB entries can be shared amongst all processes. (1) As others have pointed out, the sort of "folded address space" (my name) that is in some of the Power chips, the IBM PA, and the Intel Itanium, give you something better than tagged TLBs. I call these "folded" because, given a V1-bit virtual address, the upper S1 bits are looked up in a table that provide you with V2 = V1-S1+S2 bits of virtual address - the smaller V1-bit virtual address is "unfolded" into a larger V2-bit virtual address. I.e. the V1-bit virtual address space is divided up into 2^S1 "segments" of (2^(V1-S1)) bits each. Typical sizes are V1=64 bits, V2=80 bits, and S1=64-52 bits. So, at the very least, you can just use this as a 16 bit process ID TLB tag. But you should be able to see how this can be used for partial sharing. (2) Let me just briefly mention object ID tagged TLBs, where enries are tagged not with the process ID, but with an "objectID" that corresponds to (a) shared library id, or (b) program text id, or (c) finally, the process id for process private, unshared, data. I.e. instead of all TLB entries having the same TLB tag, a process uses TLB entries with several different TLB tags. One implementation reads TLB entries, and compares the tag to a list of "currently active" tags on the processor. A miss might constitute a TLB miss, or possibly the list of currently active tags can itself be considered to miss. Another implementation uses "activate/deactivate" CAMs or scans. It can be seen that the special handling of kernel TLB entries in process ID tagged TLBs is just a runt form of this. (3) Finally, there arises the possibility of snooped or coherent TLBs. TLB entries can be snooped to remain consistent with the memory copy of the page tables, and consistent with the current process. AMD's "TLB Probe Filter" can be considered a step in this direction, towards snoopy TLBs. Interestingly, while I was at Intel and at university (both prior to Intel, and at Wisconsin after/between Intel stints) I designed similar structures preceding AMD's announcement, and then I tried to figure out what AMD had built based on the few scanty slides; now that I have seen what AMD actually built, I can also see how to be better. If TLB miss costs matter, there could well be an arms race in this area between AMD and Intel. The nice thing about snoopy TLBs is that they do not require any architectural changes, or OS changes. The bad thing is that, done naive, they are quite expensive in hardware; potentially lots of snoopers. Of course, you don't need to naive; snoopers can be shared between many different TLB entries, if there is any degree of locality or non-sparseness in the virtual address space. I feel reasonably confident in saying that snoopy TLBs are buildable for conventional virtual memory architectures. I feel less confident in saying that snoopy TLBs are buildable for IBM style multilayer virtual machine architectures; or, rather, they are buildable, but virtualized page tables provide an extra combinatorica factor; and, since the most common ways of dealing with page tables in VMs involve the virtual machine host unmapping the virtual machine guest's page tables, it is not clear that snoopy page tables need to be extended to multilayer VMs. But they could be. It is important to note that there are two issues here: (1) snooping page table memory writes, so that the TLBs can be consistent, whether instantaneously or delayed until the next TLB "invalidate" (2) tracking which TLB entries belong to which process They are related. While it would certainly be possible to have TLBs "instantaneously" coherent with mmory, it is not clear (a) if that might not break some OSes (b) if that might not be unnecessarily expensive (b') if that might not prohibit some interesting implementations. === It's not clear if any of this is worthwhile, if TLB misses are cheap - e.g. if they can be done speculatively, if they can use the cache, etc. MIPS probably needed some help because of software TLB miss handling. Although even this can be accelerated, e.g. on a multithreaded machine. === Anyway, bottom line: Tagged TLBs are probably reasonable, since just about all implementations described above have one form or another of TLB tags. MIPS-style process ID tagged TLBs are probably a dead-end. Folded virtual addresses or object ID tagged TLBs are probably better. Snoopy TLBs are a bit more expensive, but not as expensive as the naive think, and are architecturally invisible. Path: newssvr17.news.prodigy.com!newscon03.news.prodigy.com!newsmst01.news.prodigy.com!prodigy.com!news.cc.ukans.edu!logbridge.uoregon.edu!newsfeed.vmunix.org!newsfeed.hanau.net!news-fra1.dfn.de!newsfeed01.univie.ac.at!aconews-feed.univie.ac.at!news.tuwien.ac.at!a0.complang.tuwien.ac.at!anton From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: Does PowerPC 970 has Tagged TLBs (Address Space Identifiers) Date: Tue, 13 May 2003 08:07:55 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 74 Sender: anton@a0.complang.tuwien.ac.at (Anton Ertl) Message-ID: <2003May13.100755@a0.complang.tuwien.ac.at> References: <3486820e.0305112321.1b4c3b26@posting.google.com> NNTP-Posting-Host: a0.complang.tuwien.ac.at X-Newsreader: xrn 9.03-beta-14 Xref: newsmst01.news.prodigy.com comp.arch:123751 MIME-Version: 1.0 Content-Type: text/plain "Andy Glew" writes: >Tagged TLBs are one way of allowing TLB entries to persist across >context switches. However, TLBs tagged with process IDs do not >help shared libraries or data that are shared between some, but not >all, processes. They help them just in the same way as they help non-shared VMAs, i.e, by letting their TLB entries persist across context switches. They may result in multiple TLB entries for the same object. But is this a significant performance problem? Trying to think of a typical scenario where persistence across context switches helps significantly: There would be a lot of processes that do little computation before activating the scheduler again (due to a blocking system call, e.g., I/O or IPC), and there would be one or several CPU-bound processes who would suffer from TLB misses after each activation of a non-CPU-bound process if there were no ASIDs. In such a scenario: Is there significant sharing of active pages between the CPU-bound processes and the non-CPU-bound processes? Probably not. Is there sharing between the non-CPU-bound processes? There probably would be, but if the CPU-bound processes need lots of TLB entries, it will throw out the other TLB entries anyway, so this sharing cannot be utilized. If the CPU-bound process does not need lots of TLB entries, would utilizing the sharing with more sophisticated TLB tagging help much? Is there sharing between the CPU-bound processes? Maybe, but context switches between them are rare enough that this is not a significant issue. >I'm tempted to say that process-ID TLB tagging is now known to be >a dead end wrt instruction set architecture, because of these >other alternatives: > (0) Process ID tagged TLBs Benefit: persistence across context switches Cost: some changes to the OS > (1) Folded Address Space Benefit: sharing of TLB entries between objects Cost: (In addition to OS changes) Various restrictions at the user level if you want to make use of the benefit. E.g., AIX originally allowed only 10 mmaps per process. I don't consider the benefits worth such costs. > (2) Object ID tagged TLBs 386 style segmentation? Or something like 2, but with more flexibility? The former requires lots of user-level changes. > (3) Snoopy TLBs Benefit: completely transparent to software. Cost: Additional hardware complexity. If the hardware cost can be made small enough, this looks like a winner. Otherwise I think that (0) still has the best benefit/cost ratio. - anton -- M. Anton Ertl Some things have to be seen to be believed anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html