From MAILER-DAEMON Thu Jul 29 15:14:59 1999 Date: Thu, 29 Jul 1999 15:14:59 -0700 (PDT) From: Mail System Internal Data Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA X-IMAP: 0933286497 0000000044 Status: RO This text is part of the internal format of your mail folder, and is not a real message. It is created automatically by the mail system software. If deleted, important folder data will be lost, and it will be re-created with the data reset to initial values. From eylon@cs.berkeley.edu Wed Oct 7 20:08:50 1998 Date: Wed, 7 Oct 1998 20:08:49 -0700 (PDT) From: Eylon Caspi To: Randy Huang , Michael Chu Subject: CS 262 project form Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO X-Status: X-Keywords: X-UID: 1 Hi all, Here is my first shot at the CS 262 project proposal. Your comments appreciated. It is due tomorrow. BTW, how large is a compute-page? How many cells? How much memory? Mind you, the listing of project partners is sorted alphabetically! I am not a megalomanic, I swear! Eylon Caspi University of California, Berkeley eylon@cs.berkeley.edu Electrical Engineering & Computer Science tel. 510-843-8689 -------------------------------- Name: Eylon Caspi -------------------------------- E-Mail Addresses: eylon@cs.berkeley.edu mmchu@cs.berkeley.edu rhuang@cs.berkeley.edu -------------------------------- Concise Project Title: Scheduling for Virtualized FPGA Hardware -------------------------------- Names of Project Partners: Eylon Caspi Michael Chu Randy Huang -------------------------------- Provide an abstract of your project: (2-3 paragraphs) The BRASS (Berkeley Reconfigurable Architectures, Systems, and Software) HSRA (Hierarchical Synchronous Reconfigurable Array) is a high-speed, synchronous FPGA (field-programmable gate array) with hierarchical (tree-based) interconnect and embedded DRAM, to serve as a reconfigurable co-processor for a conventional RISC core. To be useful in a general-purpose, scalable computing environment, the array should support efficient scheduling of designs too large to fit on the array, as well as multi-processing of simultaneous, independent designs. The HSRA's solution to these problems involves (i) virtualizing hardware at a granularity of "sub-arrays" or "pages" containing some compute-cells and memory (presently 64 cells and 8 MB ?), and (ii) using a stream-based, data-flow programming model which allows inter-page I/O to be buffered transparently through memory when communicating pages are not simultaneously loaded in hardware. In this CS 262 project, we wish to consider the problem of automatic scheduling of computations (which have been pre-partitioned into pages) on the HSRA. The project will involve formalizing some details of the HSRA's runtime model, developing one or more scheduling policies and corresponding performance metrics, and evaluating the performance of a workload of hand-partitioned multimedia programs. For performance evaluation, we intend to develop a behavioral simulator which schedules compute-pages as black-boxes (implemented by Java behavioral code) with known communication requirements. -------------------------------- What is the key contribution you hope to achieve with your project? What do you hope to prove (or disprove) with your project? What is the research hypothesis to be tested by the project? Who would be interested in your research results, and why? - virtualized FPGA pages are a "good thing" - develop placement/scheduling policy(s) for a time-shared FPGA - develop behavioral simulator for HSRA (for future BRASS work) - evaluate best compute-page size? -------------------------------- What is the methodology for your project? If you are building something, give the platform(s), language, and test cases. If is trace-based, describe how you will collect and analyze the traces. If it is an evaluation, list the input cases and describe why you believe (or will conclude) that these are representative. - Develop placement/scheduling policy(s) based on particular metrics (e.g. job or data throughput) - Develop behavioral simulator to schedule pre-partitioned designs. Compute pages will have (i) behavioral Java code and (ii) specified communication requirements (nodes, rates, real-time, etc.) - Write behavioral code for several multimedia applications * partition job into "pages" (estimate page size) * maybe multiple implementations for different design criteria (throughput, memory, etc.) or different page sizes? -------------------------------- How far do you expect to be by the next survey (November 5th)? The next major checkpoint is one month away. Think now about where you hope to be. - Study + refine (preliminary) HSRA runtime model - Read about resource-allocation + scheduling for: * producer-consumer problems (esp. on multiprocessors) * data-flow architectures - Propose performance metrics + scheduling policies tailored to them - Choose benchmarks & think about implementation / page-partitioning - Build behavioral simulator infrastructure (start on it?) - To do later: * implement benchmarks (most tedious part?) * run simulations * write report -------------------------------- What is your project's web page? We would like you to begin assembling a web page for your research project, incorporating your abstracts, plans, and incremental results. The instructor will begin visiting these sites from time to time to check on your progress. These will be linked to from the CS 262 Home Page. Remember, your colleagues in the course will be looking at these pages! http://www.cs.berkeley.edu/~eylon/cs262/ -------------------------------- What additional help do you need for your project? Resources, accounts, etc. - Pizza - Palm III's - Peace in the Middle East - A working version of JDK -------------------------------- From mmchu@screech.CS.Berkeley.EDU Wed Nov 4 14:39 PST 1998 Received: from screech.CS.Berkeley.EDU (screech.CS.Berkeley.EDU [128.32.36.94]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id OAA29836; Wed, 4 Nov 1998 14:38:58 -0800 (PST) Received: from localhost (mmchu@localhost) by screech.CS.Berkeley.EDU (8.8.3/8.6.9) with SMTP id OAA17855; Wed, 4 Nov 1998 14:38:57 -0800 (PST) Date: Wed, 4 Nov 1998 14:38:57 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [CS262] Nov. 4, 1998 Meeting Notelets. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1018 Status: RO X-Status: X-Keywords: X-UID: 2 - We said we would each think about the interface a bit more and email/meet in a few days (interface being the flow of events in coding for the runtime system and scheduler). - 3 apps: - wavelet (rhuang) - triangle rasterizer (mmchu) - ap?? (still can't get the name right... :-)) (eylon) - multiple kernels can be scheduled. - kernel is given arbitrary inputs and eventually, the outputs are collected (no tightly-coupled computations with the processor). - current swap time: 3300 cycles, ~13 us. (state out, config in, state in). - initially, we said statically known input rate ratios. but not sure if this is too restrictive. - 64 BLB page size (must be fixed for hand partitioning). If I missed anything (I'm sure I did), please let me know. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@ribbit.CS.Berkeley.EDU Mon Nov 9 04:21 PST 1998 Received: from localhost by ribbit.CS.Berkeley.EDU (8.9.1a/8.9.0) with SMTP id EAA20631; Mon, 9 Nov 1998 04:21:22 -0800 (PST) Date: Mon, 9 Nov 1998 04:21:22 -0800 (PST) From: Michael Chu To: Eylon Caspi , Randy Huang cc: "Andre' DeHon" , Michael Chu Subject: [CS262] Premeeting Notes(mmchu) Nov. 9, 1998 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 5841 Status: RO X-Status: X-Keywords: X-UID: 3 These are my notes/ideas on how the model/implementation/knobs-to-tweek should be. We can discuss this model after the BRASS meeting. I want to start coding this up this week, so we really need to decide on an interface (even if it is not perfect) so we can get on with some of the real work. Premeeting Ideas for Project November 9, 1998 Michael Chu ------------------------------------------------------------------------------- Model assumptions/restrictions: - A compute page is 64 BLBs large. - The designs will be hand-partitioned (therefore, necessarily rough) into 64-BLB compute pages connected via stream FIFO buffers. - Designs (or sum of designs) may be larger than the available number of BLBs in the physical array. - Multiple designs will be temporally/spatially partitioned on the array at once. - Compute pages may have multiple inputs and outputs. - Stream buffers have read request, write request, read data ready, write data done signals. In addition, to signal to the scheduler system, stream buffers have buffer full, buffer high watermark, buffer low watermark signals. - All inputs and outputs are FIFOs and data must be accessed and processed sequentially. - If array has N physical compute pages, any of the N compute pages is guaranteed to be routable to any of the other N-1 compute pages. - All compute pages are connected via stream buffers of finite length (since pages may have dynamic input/output rates). - The only interaction between a design and the application which spawned it are: design start (with specified finite input stream), design finish (with resulting finite output stream). - Output pages have arbitrarily large buffers (so there should be no issue of design output being full). - CMBs are available which assume cycle read/writes and no stalling. Simple SRAM-like signaling: request, read/write bar, write port, read port, address port. Assume arbitrarily large CMB with 64-bit wide input/output and separate read/write ports. Development path: - The user breaks down the design into compute pages which roughly correspond to 64-BLB blocks. - The user writes the behavioral code for each compute page and specifies page inputs/outputs via class variables. - The user encloses the behavioral code for each compute page in the cycle() method of a class that is subclassed from ComputePage. The functional specification for the cycle() method is that, every time it is called, it should correspond to 1 clock cycle of the array. - The user provides a method, isRunnable(), which, based on compute page state, determines whether the compute page is currently runnable. (You can imagine that, in a real system, this could be code that could be run when a page is swapped out; examining the state being swapped out on its way to a CMB). This could be used if there is pipelined data still in the page, but there is no input to the page. - The user then subclasses the StreamDesign class and provides a constructor that instantiates all of the compute pages (which are declared as class variables) and all of the stream buffers (which are declared as class variables). The constructor must also connect together the compute pages with stream buffers. - Inputs should be specified by instantiating InputPage pages. (which are declared as class variables). - Outputs should be specified by instantiating OutputPage pages. (which are declared as class variables). - At this point, the user can instantiate the subclassed StreamDesign class representing the design and run the runDesign() method (which is inherited). This will: - determine the dataflow graph (as described by the compute pages and stream buffers). - place this graph into a suitable data structure. - each node represents a compute page (and references the actual compute page). - each node has an array of input/output references to source/sink nodes (of pages). - each input/output array of references to other nodes has a corresponding array of references to stream buffers so that the scheduler can check the status of those buffers before scheduling those pages. - create an array that points to all of the compute pages. - create an array that points to all of the stream buffers. - pass the design, along with the dataflow graph, array of compute pages and array of stream buffers, to the scheduler class. - call the runSchedule() method on the scheduler class. - To see if a design is completed, the user should poll the design with the method isRunning(). - The user may submit other jobs to the scheduler by calling runDesign() on that design when desired. - The user should (must) call cycle() on the scheduler to have it simulate a cycle. - The scheduler is essentially a static class: StreamScheduler. Therefore it does not have to be instantiated. Parameters to tweek: - Page swap time: - Cycles for swapping out state. - Cycles for swapping in config. - Cycles for swapping in state. - Inter-page buffer sizes. - Number of physical compute pages present (super-page). - Number of clock cycles for a signal from one compute page location to reach another compute page location (maybe different for each pair). - Number of clock cycles for the scheduler to make scheduling decisions: - whether a page should be swapped in/out. - which page to swap in/out. Things measuring: - design completion time vs. above parameters. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From eylon@cs.berkeley.edu Mon Nov 9 19:15:56 1998 Date: Mon, 9 Nov 1998 19:15:55 -0800 (PST) From: Eylon Caspi Reply-To: Eylon Caspi To: Randy Huang , Michael Chu cc: Andre' DeHon Subject: [CS262] 11/9/98 meeting notes Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO X-Status: X-Keywords: X-UID: 4 -------- Restricted data-flow model: (1) compute-pages communicate using tokens (data + time-stamp doublets) through LIFO stream buffers (2) a page "steps" on a fixed-size set of input tokens (stepping-set), i.e. consumes inputs at fixed relative rates (fixed ratios). - e.g. step consumes 2 of input A plus 1 of input B - in general, a page may have data-dependent relative input rates, e.g. data on input A determines consumption rate of input B. - conjecture: any design with data-dependent relative input rates can be partitioned into blocks with fixed relative input rates - alternate terminology: input-rate = consumption rate relative input-rate = consumption rate ratio - Q: stepping-set may be "empty?" (for pages which generate output without needing inputs) (3) a "step" may produce arbitrarily many output tokens, i.e. output-rate may be data-dependent. - e.g. run-length decoder -------- Specifying / Partitioning Applications: (1) Partition app into communicating pages: - specify data-flow - no control-flow (no asynchronous signals, no conditionally-executed code, mux small if-then-else results) - local state allowed - stall at page granularity (no pipeline bubbles) (2) Data flow graph (DFG) specifies for each compute-page: - pages providing input - pages receiving output - relative input consumption rates - minimum time between steps (i.e. max throughput) (previous 2 may be combined by specifying input rates in cycles?) - relative output production rates (if known, relative to input arrivals or cycles) (if not known, scheduler may benefit from approximation, e.g. expected rate, or upper/lower bound) (3) Behavioral description encapsulated in per-page "step()" method: - called when: (i) page is resident (ii) stepping-set of inputs is available (iii) page is ready for step (by throughput spec) (iv) no output-buffer is full - produces arbitrarily-many output tokens, time-stamped arbitrarily-far into the future - interface to streams via: pop(): istream --> token push(): token --> ostream -------- Simulator: (sample implementation) (1) each page has input buffers & output buffers (see (5) below) (2) maintain collective contents of all output buffers in central priority queue, ordered by token time-stamp (3) step time by pulling head of priority queue, delivering it into receiving pages' input queues, and calling step() for pages which are ready (4) throughput cap can be handled by re-time-stamping tokens to a later time as they are visited in the priority queue - suffices to do this only for last token of a page's stepping set - better to retime in receiver's input queue? (when to wake?) (5) alternate object view: stream-buffer objects contain both page-output and page-input sides (complete LIFO implementation; (2)-(4) still hold) -------- Scheduler: (1) Job: schedules + places compute-pages for execution on array - fixed # of physical pages available - need notion of placement, routing delay (2) Scheduling policy to be determined. e.g.: - time-sliced (timer interrupt) - hints from input/output rates - metrics to optimize? Eylon Caspi University of California, Berkeley eylon@cs.berkeley.edu Electrical Engineering & Computer Science tel. 510-843-8689 From mmchu@screech.CS.Berkeley.EDU Tue Nov 10 15:43 PST 1998 Received: from screech.CS.Berkeley.EDU (screech.CS.Berkeley.EDU [128.32.36.94]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id PAA17823; Tue, 10 Nov 1998 15:43:38 -0800 (PST) Received: from localhost (mmchu@localhost) by screech.CS.Berkeley.EDU (8.8.3/8.6.9) with SMTP id PAA07828; Tue, 10 Nov 1998 15:43:36 -0800 (PST) Date: Tue, 10 Nov 1998 15:43:36 -0800 (PST) From: Michael Chu To: brewer@cs.berkeley.edu cc: Randy Huang , Eylon Caspi , Michael Chu Subject: [CS262] Progress Email. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 2198 Status: RO X-Status: X-Keywords: X-UID: 5 CS262 Project Progress Report - Fall 1998 Scheduling for Virtualized FPGA Hardware Eylon Caspi (eylon@cs) Michael Chu (mmchu@cs) Randy Huang (rhuang@cs) November 10, 1998 ############################################################################### Original Promised Progress for November 10, 1998 ------------------------------------------------ Quoted from http://www.cs.berkeley.edu/~eylon/cs262/proposal.txt: - Study + refine (preliminary) HSRA runtime model - Read about resource-allocation + scheduling for: * producer-consumer problems (esp. on multiprocessors) * data-flow architectures - Propose performance metrics + scheduling policies tailored to them - Choose benchmarks & think about implementation / page-partitioning - Build behavioral simulator infrastructure (start on it?) ############################################################################### Actual Progress by November 10, 1998 ------------------------------------ - Extensively discussed the behavioral simulator/scheduler infrastructure: - how to specify a design (what is the model the user views when they create a compute page). - model/restrictions (what restrictions are we putting on the general model, and whether those restrictions are reasonable). - Chose 3 apps/kernels which will be used to benchmark the system: - ADPCM. (most statically scheduled) - Wavelet Encoding. - Gouraud-shaded Triangle Rasterization. (most dynamically scheduled) - Read papers on scheduling of static/dynamic dataflow graphs. ############################################################################### Explicit Goals For Final Project -------------------------------- - Build the actual infrastructure necessary for the simulator/scheduler. - Implement at least one scheduling algorithm for scheduling multiple multi-compute-page designs. - Map the above given 3 apps/kernels onto the infrastructure and run them. - Vary runtime parameters (i.e. page swap cost, number of physical pages, etc.) to see how that affects system performance. ############################################################################### From rhuang@cs.berkeley.edu Tue Nov 10 17:08 PST 1998 Received: from agate.berkeley.edu (agate.Berkeley.EDU [128.32.155.1]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id RAA20803; Tue, 10 Nov 1998 17:08:06 -0800 (PST) Received: from cs.berkeley.edu (rustle.CS.Berkeley.EDU [128.32.131.215]) by agate.berkeley.edu (8.8.5/8.8.5) with ESMTP id RAA17473; Tue, 10 Nov 1998 17:08:05 -0800 (PST) Sender: rhuang@agate.berkeley.edu Message-ID: <3648E375.430CCFF5@cs.berkeley.edu> Date: Tue, 10 Nov 1998 17:08:05 -0800 From: Randy Huang Organization: UC Berkeley Computer Science Division X-Mailer: Mozilla 4.07 [en] (X11; I; SunOS 5.5.1 sun4u) MIME-Version: 1.0 To: Michael Chu CC: amd@cs.berkeley.edu, Randy Huang , Eylon Caspi Subject: notes from 11-10 meeting References: Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Content-Length: 1120 Status: RO X-Status: X-Keywords: X-UID: 6 1. should we use/modify an existing system vs starting from scratch? Answer: start from scratch 2. When specifying the input rate, we also need to specify an absolutely rate (at least for one of the input). This is to deal with the fact that there might be feedback in the compute page and that it is not always possible to receive inputs on every clock cycle 3. We need to explicitly specify dataflow above page level. This way, the scheduler does not need to extract this information later on. For example: connect ("P1.o", "P2.a") 4. The idea of super-buffer (coined by Michael d:o) is that buffer needs to change its size depending on its current state. For example, it might need to split into by when directed by the scheduler or it might need to reduce its size when both pages connected to it are co-resident. 5. Dynamic hardware replication has to be done by the scheduler, not the mapper since only the scheduler would know how much space is available on the hardware and therefore how much replication is allowed. 6. Come up with a style (how to code the compute page) for tomorrow's meeting (2-3:30). From mmchu@screech.CS.Berkeley.EDU Wed Nov 11 17:05 PST 1998 Received: from screech.CS.Berkeley.EDU (screech.CS.Berkeley.EDU [128.32.36.94]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id RAA01219; Wed, 11 Nov 1998 17:05:24 -0800 (PST) Received: from localhost (mmchu@localhost) by screech.CS.Berkeley.EDU (8.8.3/8.6.9) with SMTP id RAA11674; Wed, 11 Nov 1998 17:05:23 -0800 (PST) Date: Wed, 11 Nov 1998 17:05:22 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [CS262] Next meeting (Tomorrow, 3:30pm->5pm). Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1675 Status: RO X-Status: X-Keywords: X-UID: 7 Randy, We are planning to meet tomorrow 3:30pm->5pm. Accomplishments today: - went through general implementation requirements for ADPCM. - figured out roughly number of pages required. - discovered, that due to way adaptive nature of the algorithm, there is a loop back from the output back to the filter. This means that, the filter/predictor can process at most 1 token, and must wait for the result of that token before it can process another token (easier explained in person). - went through my example of a compute page implementation. - questions: Is this sufficient? For eylon, this is sufficient since all of his compute pages just take in 1 token and generate 1 token. My triange rasterizer pages have control signals passed around which makes it more interesting. Randy, just to remind you, we are assuming that you will take on understanding the WAVELET kernel example and partitioning it into pages. Just wanted to make sure that was on your queue. Please see if you have any other comments on the compute page model. It would be nice to nail down some interface tomorrow and start coding up the kernels in it... I would say that the interface does not have to be in stone... if we later on decide to add another method call to specify another parameter, that would be fine... but, it is important that we don't brainstorm too much longer... Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@screech.CS.Berkeley.EDU Fri Nov 13 15:11 PST 1998 Received: from screech.CS.Berkeley.EDU (screech.CS.Berkeley.EDU [128.32.36.94]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id PAA11647; Fri, 13 Nov 1998 15:11:51 -0800 (PST) Received: from localhost (mmchu@localhost) by screech.CS.Berkeley.EDU (8.8.3/8.6.9) with SMTP id PAA18265; Fri, 13 Nov 1998 15:11:50 -0800 (PST) Date: Fri, 13 Nov 1998 15:11:49 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [CS262] Notes Nov. 12, 1998 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1431 Status: RO X-Status: X-Keywords: X-UID: 8 - This weekend, we will each work on implementing our given kernels in the compute page model. - Next week, we will begin work on the simulator (and possibly any work to make sure our coding interface for the pages is consistent between us). - Simulator todo: - superclasses: - ComputePage - Design - PageInput, PageOutput - helper methods. - How does a page notify scheduler it is "done"? - each compute page is assumed to have a designated BLB which can signal on a "done" wire. - in the code, this is done by calling a signalDone() method in the step() method of each page. - each page is responsible for signaling "done" for itself. - if you need to propagate "done" through your pipeline, this is done in an application-specific manner (page input/output). - all scheduler knows about is individual page "done" signals. - state of a page stored in scheduler ("done", running, etc.). - Meet again Monday. - Let's meet after the BRASS meeting. I can meet up until about 3:30pm since I have a doctor's appointment at 4:00pm and need to make sure I can catch the appropriate buses (2 buses) to get down to about Telegraph/Ashby. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@screech.CS.Berkeley.EDU Mon Nov 16 20:18 PST 1998 Received: from ribbit.CS.Berkeley.EDU (ribbit.CS.Berkeley.EDU [128.32.131.152]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id UAA29636; Mon, 16 Nov 1998 20:18:35 -0800 (PST) Received: from localhost by ribbit.CS.Berkeley.EDU (8.9.1a/8.9.1) with SMTP id UAA15365; Mon, 16 Nov 1998 20:19:22 -0800 (PST) X-Authentication-Warning: ribbit.CS.Berkeley.EDU: mmchu owned process doing -bs Date: Mon, 16 Nov 1998 20:19:22 -0800 (PST) From: Michael Chu X-Sender: mmchu@ribbit.CS.Berkeley.EDU Reply-To: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [CS262] Meeting Time. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1960 Status: RO X-Status: X-Keywords: X-UID: 9 Hi guys, Since we didn't get a chance to meet today, let's setup a meeting time for tomorrow. So, I know you both meet with Andre' tomorrow, so, if I remember correctly, the time we can meet is 3:30-5:00pm? If this is a problem, email immediately. Also, I have my design coded up... It is in: ~mmchu/class/cs262/project/trianglerasterizer TriangleRasterizer.java is the "Design" class. It instantiates 3 "ColorPage" pages, one each for each of the R/G/B color values. Each of these "ColorPage" pages outputs the next color value and then increments to the next value, continually every clock cycle until N values have been output. Each page keeps track of how many N it is supposed to output (set at instantiation time) and then asserts the signalDone() signal when it is finished. The colors are then merged using a "MergePage" page which takes the outputs of the "ColorPage" pages and then writes it into the CMB (packed) attached to the subarray. This essentially means that there are 4 pages per scanline of a triangle. The one problem with this is that, since I am specializing the hardware for every single triangle, every single triangle needs to have its special configuration loaded in... this means, currently, at least 1000-3000 cycles PER TRIANGLE! Now, the solution to this is either: 1) Decrease the amount of time spent during configuration. 2) Move more of the computation onto the array. I will look into #2... The one problem is that, I have only implemented part of the rasterizer in hardware for CS294-6... some of the setup code is currently still in Java-only... Anyways... I will look into it... Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@ribbit.CS.Berkeley.EDU Mon Nov 16 22:46 PST 1998 Received: from localhost by ribbit.CS.Berkeley.EDU (8.9.1a/8.9.1) with SMTP id WAA16519; Mon, 16 Nov 1998 22:46:08 -0800 (PST) Date: Mon, 16 Nov 1998 22:46:08 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [CS262] Triangle Rasterizeration. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1929 Status: RO X-Status: X-Keywords: X-UID: 10 After thinking a little about it, I realized/remembered that the reason Randy and I originally thought that instantiating a new design for every triangle was that, to do otherwise would require us to change our model restriction (since we must then be able to support "select" inputs... essentially dynamic rate inputs)... Instead, my current method is purely static input consumption rate (and actually, static output consumption rate). Also, another problem is that, to fully implement the design in hardware (from verticies/colors -> rasterization) would require multipliers/dividers... These multipliers/dividers may be rather large.... (also, I don't have experience in designing good dividers so am not quite sure what a reasonable BLB count would be)... I could probably always just have them take multiple cycles (just have it iterate instead of fully spatial)... this might be important to guarantee each operator (multiply/divide) does not go out of 1 page... I guess I'd like you guys' input on this... The first option makes the coding very simple but points out the problems with the reconfiguration time... The second option make coding very hard, necessitates dynamic input rates (more importantly, the ability to run a page an arbitrary amount of cycles before requiring an input stream buffer to be non-empty)... It would yield a more realistic result, but require much more coding on my part and the groups part (for the simulator, etc.)... I can explain in more detail tomorrow... My main concern is what is the best decision to get the classwork done. (classwork first and then, when there is not as much time pressure, full brass model). Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@screech.CS.Berkeley.EDU Wed Nov 18 17:51 PST 1998 Received: from screech.CS.Berkeley.EDU (screech.CS.Berkeley.EDU [128.32.36.94]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id RAA06345; Wed, 18 Nov 1998 17:51:05 -0800 (PST) Received: from localhost (mmchu@localhost) by screech.CS.Berkeley.EDU (8.8.3/8.6.9) with SMTP id RAA05480; Wed, 18 Nov 1998 17:51:04 -0800 (PST) Date: Wed, 18 Nov 1998 17:51:04 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: Next meeting. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 5093 Status: RO X-Status: X-Keywords: X-UID: 11 Guys, When did you guys want to meet next? I can meet Friday, but I assume Eylon has quantum to finish? (Eylon: confirm/deny). So, another way to look at it is, between now and the next meeting, what do we want to get done? I would say: - Each person code up the kernel (Completely... doesn't mean that it is the world's best or even the most logical implementation... just means that, if I had a simulator today, and ran your design, it should eventually dump out the correct answer... we can always go back and tweek the implementation later... but having AN implementation allows us the option to turn in if time runs outs...) - I would like to fill in the some of the skeleton for the runtime system libraries. This would entail adding any extra classes necessary so that the kernels can compile against them successfully... This DOES NOT MEAN THAT ALL THE CALLS TO THE LIBRARY ARE FUNCTIONAL (ESPECIALLY THE ONE'S THAT WILL CALL OUT TO THE KERNEL). This mainly has to do with some of the "convenience" things such as making sure all of your PageInput PageOutput objects are instantiated for a ComputePage before you enter the constructor (I believe I can automatically do this for you). This also includes adding in any of the other interface methods we had come up with during the meeting. - I would then, in addition, like us to each think about HOW the simulator is linked in to the scheduler... i.e. how the scheduler resolves (i.e. goes back and fixes) the descrepancies between the relative delays times of the data tokens the ComputePage specifies for its output values and the actual arrival times (after taking into consideration swapping of pages, etc.). So, I am most concerned with how we will actually IMPLEMENT (i.e. ideas on concrete Java code) since, being able to have the system run at all is a vital part of our project which I don't want to just casually believe we can handle... (I am still not 100% confident about how to do this... I hope to have some sample runtime system code written by Monday that will try to deal with this... even with a stupid/naive scheduler... just to prove that it is doable...) I assume that we really should make sure that, how the scheduler interfaces to the simulator and the actual scheduler decision theory should remain separate (i.e. scheduling algorithm changes SHOULD NOT adversily affect how the scheduler manipulates the simulator... all it should do is require that, at each scheduler decision point, it be given more/less/different information, but returns the same set of actions, i.e. promote the following stream buffer values to available, offset the following stream buffer values to be be delayed more in order to simulate a page being swapped out). So, in this bullet point, I am trying to think about separating the implementation from the scheduling theory so that we can be already running our designs through stupid simulators (to prove at least our designs are correct in construction) while we try to refine the scheduling algorithm... I think that would put us on a much more solid boat since this allows us to incrementally improve the scheduling algorithm, and still turn in what we have at the time if we run out of time... So, given the enormity (in my opinion) of what we might want to accomplish between now and the next meeting (because, without this, I think we will just end up wasting time arguing), why don't we meet on: MONDAY, NOVEMBER 23, 1998 (3:30->5:30pm) Agenda: - did anyone, while implementing their kernel, find that the current model (w/ restrictions) was too limiting? if so, what did you (unilaterally) add to our model so that your application could be implemented? (if 2 people added similar things, can we combine them in some fashion? or, is there an EASY way to get around it?) - if there were any changes to the model, what are the method calls that you need added to the runtime system library? - what was your idea on how the simulator and scheduler should interact? can your idea isolate scheduler changes from the simulator/scheduler interface? (the idea is that, we can drop in different schedulers as we improve our algorithm). - what should our next meeting goals be? (probably, try to develop some initial scheduling algorithms...) (also, to completely code in the simulator/ scheduler interface (not necessarily have a wonderful scheduler coded...)) - map out a concrete schedule (may change, but will at least help make us feel bad when we start seeing we don't make the progress we wanted... :-)) - anything else? (I think this will easily take the 2 HRs.). Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From eylon@cs.berkeley.edu Thu Nov 19 14:48:27 1998 Date: Thu, 19 Nov 1998 14:48:26 -0800 (PST) From: Eylon Caspi Reply-To: Eylon Caspi To: Randy Huang , Michael Chu cc: Andre' DeHon Subject: [CS262] mtg notes 11/17/98 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO X-Status: X-Keywords: X-UID: 12 Notes from 11/17/98 meeting: (Theme of today: explicit page I/O requests) - Triangle rasterizer was originally written with reconfig per triangle in order to avoid variable relative input rates (line-width input controls #iterations). - Alternate implementation with fixed relative input rates involves a controller which converts line-width input to a line-long sequence of control tokens (simple case: unary encode 4 --> 1111) [a.k.a IDF "repeat" operator] - Our original I/O model (still valid): Page fires on a fixed-size input set (fixed relative input rates) to produce a variable number of output tokens. Timing: (i) stall page on empty input buffer (ii) fire when input arrives (iii) stall page on full output buffer ==> requires explicit input & output requests by each page - In simulator (i) step() == fire (ii) queue() == output request (iii) exiting from step() == input request (don't call again until next input available) Note, step() as in (iii) is event-driven (call when input available) Equivalent to polling-loop which explicitly waits if input not available. - How does scheduler know that a page firing is done? (i) page stalled on empty input buffer (explicit input request) (ii) source pages (no inputs --> no stall on input) - Stalling on full output buffer: (in simulator) - ideally, each page would have behavioral code (step()) in a thread, suspended when stall on I/O request (empty input, full output). Seems to requires user thread scheduling to do accurately. ==> not available (i) for pages with inputs, suffices to let step() code generate time-stamped output tokens, possibly overfilling output buffer. Then simulator offsets token time stamps to account for page stalling. (ii) for source pages, never stall on inputs, so step() handler would run indefinitely. Instead, step() should explicitly check for full output buffer and exit if full. We need to nail new (?) conditions for when step() is called for such a page. - Time-stamped tokens ( (i) above ) require some explanation. Idea is to allow page to compute + enqueue long-latency outputs in the same step() call as the associated inputs. For pipelined, feed-fwd designs, these outputs will not be ready in HW until some subsequent firing. Emitting them from behavioral code as soon as they can be computed saves work -- don't need to code up pipeline registers as explicit page state. Futuristic time-stamps on outputs may be wrong if page stalls before the true firing for those outputs. (i) simulator must be able to order and re-time-stamp output tokens. (ii) simulator must know how long each firing really takes, i.e. when page HW can take next input set (time is less than output latency for pipelined designs). Either: - static firing time (can be specified in page construction) - dynamic firing time (must be specified in each step()) Scheduler may benefit from statically-specified min/max/ave firing times, even if firing time is dynamic in general - Ideas for triangle rasterizer: - multiple parallel rasterizers: [rasterizer] mem --> demux ---> [rasterizer] --> mux --> mem [rasterizer] (i) each rasterizer does a line of same triangle. each rasterized line must be padded to same width (triangle width or image width?) * multiple rows referenced per cycle --> miss RAM row-buffer (ii) each rasterizer does a column of same triangle, so group produces (part of) a row in lock step * produce 1 row at a time --> fewer row-buffer misses (iii) [11/18/98] each rasterizer does lines in 1 region of image * no interference in RAM * can parallelize demuxers (cross-bar to rasterizers) * demux can filter triangles by height, only send to appropriate rasterizer(s) - Scheduler vs. Simulator: (i) Scheduler decides what configs run in what pages, + when to reconfig (ii) Simulator fires+stalls resident pages (HW specific) - Explicit I/O requests seem to require hardware stream buffers + page stalls (e.g. accordioning data for back-pressure). NO SUCH SUPPORT exists on Trumpet test-chip: can only stall entire array, & only when waiting for a CMB (e.g. miss row buffer). - ADPCM update: (i) Andre suggests pre-calculating multiple-paths then selecting based on critical-path control lines (ii) increases design area, shrinks latency (was ~35-slow, now ~16-slow) Eylon Caspi University of California, Berkeley eylon@cs.berkeley.edu Electrical Engineering & Computer Science tel. 510-843-8689 From amd@growl.CS.Berkeley.EDU Thu Nov 19 19:53 PST 1998 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id TAA17089; Thu, 19 Nov 1998 19:53:41 -0800 (PST) Received: from growl.CS.Berkeley.EDU (growl.CS.Berkeley.EDU [128.32.35.152]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id TAA20446; Thu, 19 Nov 1998 19:53:12 -0800 (PST) Received: (from amd@localhost) by growl.CS.Berkeley.EDU (8.8.3/8.6.9) id TAA24142; Thu, 19 Nov 1998 19:53:39 -0800 (PST) Date: Thu, 19 Nov 1998 19:53:39 -0800 (PST) Message-Id: <199811200353.TAA24142@growl.CS.Berkeley.EDU> To: eylon@CS.Berkeley.EDU CC: rhuang@CS.Berkeley.EDU, mmchu@CS.Berkeley.EDU In-reply-to: (message from Eylon Caspi on Thu, 19 Nov 1998 14:48:26 -0800 (PST)) Subject: Thoughts on scheduling From: "Andre' DeHon" Reply-To: "Andre' DeHon" Organization: UC Berkeley BRASS X-face: zXx,|&]HuS;w?NU71cDc=Q}P`'1 JvDDtN[*09(DjFxt<&rg',Nz"|5#Bv<,H8?e"Y]S>8V=F4&wP6NO/! |H~lYV'++i9#G`?qx$0G^i|+M|WNt[yF5W#*>%X[NW0" Content-Type: text Content-Length: 2595 Status: RO X-Status: X-Keywords: X-UID: 13 I haven't heard a clear description of the scheduling strategies under consideration, yet, so I thought I'd toss in a few thoughts on trying to sort it out. It seems like there are several axes here. (1) how much information you have on the application runtime characteristics none --> estimates from prior runs --> online stats for this run --> a posteri stats. from this exact run --> full token generation sequence trace from this run... [hmm, good question, is the full count of tokens produced by each operator in response to each input enough information to search for the best feasible a posteri schedule? ] (2) runtime (re)placement is performed * "static" placement only (for target array size) * "startup/creation" placement * (re)placement at every step (1) assume free [ref.] (2) assign cost * develop migration strategy (assess when should move) (1) a posteri [ref] (2) previous run stats (3) opportunistic / work stealing e.g. 1st pref. to running a page where run before if fine a idling (none of it's virtual pages enabled) load new virtual page code, and run [pref. to unloaded, then loaded elsewhere and enabled...] (a) once start loading, committed (b) can abandon/suspend if one of its pages becomes enabled (4) maybe run work stealing until converge, then use data/placement from that for future runs? (3) operation gang scheduling ? how do we take advantage of direct/spatial links * independently schedule everything (i.e. never) * gang schedule only cycles (probably the real initial starting point) * "static" or "startup/creation" selection of blocks to gang schedule together * greedy gang schedule, but allow to devolve? * pick gang-schedule candidates from runtime data (1) a posteri from this run [ref.] (2) previous run or aggregate stats (3) adapt during run Andre' From eylon@cs.berkeley.edu Mon Nov 23 22:53:33 1998 Date: Mon, 23 Nov 1998 22:53:31 -0800 (PST) From: Eylon Caspi To: Michael Chu , Randy Huang Subject: [CS262] forward time-stamped tokens Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO X-Status: X-Keywords: X-UID: 14 A few more words about forward time-stamps in the HSRA behavioral simulator (to appease the nay-sayers). The problem: - page behavior == step() firing routine - step() == input event handler, produces outputs - problem: stalling step() on output-stream overflow is difficult; requires threads or convoluted code solution: don't stall step() - problem: keeping internal state for a pipelined page is cumbersome solution: don't keep state - both problems are solved by immediate, "speculative" emission of output tokens, to be retimed automatically later. Tokens as Events: - each step() is fired at a well-defined time - step() outputs tokens as soon as computed, giving latency/delay from firing time: outputPage.queue( value, fwd_time_offset ); - enqueued tokens have delivery time-stamp: delivery_time = firing_time + compute_latency + network_latency - i.e. enqueued tokens are events, totally-ordered in time. simulator just picks the soonest token to deliver / fire-on. - note, a page cannot fire until a complete set of input tokens arrive, which may be well after delivery-time of the 1st token. Retiming Future Tokens: - forward output tokens may not be delivered at stamped delivery-time, if: (i) output stream is full --> page stalls --> tokens not output (ii) emitting page swaps out - need a way to "retime" future tokens for such delays (job of simulator + stream buffer) - enqueued tokens exist in 1 of 3 states: (1) future, possibly overflowing buffer (speculative) (2) future, guaranteed buffer space (enqueued) (3) delivered (ready to be dequeued) maybe refine (3) to: (3) enroute, (4) delivered? - normal operation: (i) enqueued token enters stream buffer in state (2) (stream buffer stores forward and present tokens) (ii) token "matures" to state (3) when the current time passes its time-stamp (delivery time) (iii) token is dequeued at destination - stream-buffer overflow: (i) if an output stream-buffer has free space when step() enqueues a forward output, then that space is guaranteed to exist when the token matures ==> token begins in state (2) (ii) if an output stream-buffer is full (of forward or present tokens) when step() enqueues an output, then buffer space is not guaranteed for that token when it matures. Page may or may not stall at that time, i.e. output is speculative ==> token begins in state (1) (iii) only state (2) tokens can be delivered, i.e. mature to state (3) (iv) state (1) speculative token upgrades to state (2) when buffer space becomes available. e.g. token enqueued state (1) into full buffer at time N with time stamp T; a state (3) token is dequeued at time N+D, freeing a slot; then the speculative token is upgraded to state (2) with new time stamp T+D. - page swap: (i) state (1) & (2) tokens have not yet been injected into the net, so they stall with the producer page. Tokens must be retimed when producer is reconnected to a live destination, keeping same token state (ii) state (3) tokens are on the net, so they drain to the consumer (no explicit retiming?) e.g. producer page is swapped out at time N, having a state (1) (or (2)) token time-stamped T1, plus a state (3) token time- stamped T3. The T3 token is drained to consumer, which is still live. The producer is swapped in at time N+D, whereupon the T1 token is retimed to T1+D, still in state (1) (or (2)). Q: what is proper network draining procedure? e.g. consumer has 1 token delivered, but needs 2 before it can consume. If disconnect producer & consumer, must save the 1 delivered token at the consumer. How to inject it back into the stream buffer later? On a different topic... CMBs in "special" pages: - how should a CMB be represented in our infrastructure? should be hidden behind stream buffer - random access: page supplies address token to read, address+data tokens to write - stream: CMB contains & increments own pointer, value & stride set by microprocessor (workload module) - CMB can be embedded in a stream buffer (i) 2-way connected stream: connects 2 compute pages (ii) 1-way connected stream: 1 end is CMB storage (iii) but how to reverse read/write direction for next computation? - CMB can be embedded in a special page (i) reads/writes a java array (ii) page must be pinned (not swapped out - expensive!) (iii) but how to reverse read/write direction for next computation? Eylon Caspi University of California, Berkeley eylon@cs.berkeley.edu Electrical Engineering & Computer Science tel. 510-843-8689 From rhuang@cs.berkeley.edu Tue Nov 24 08:16 PST 1998 Received: from agate.berkeley.edu (agate.Berkeley.EDU [128.32.155.1]) by ribbit.CS.Berkeley.EDU (8.9.1a/8.9.1) with ESMTP id IAA22986 for ; Tue, 24 Nov 1998 08:16:49 -0800 (PST) Received: from cs.berkeley.edu (rustle.CS.Berkeley.EDU [128.32.131.215]) by agate.berkeley.edu (8.8.5/8.8.5) with ESMTP id IAA22124; Tue, 24 Nov 1998 08:15:58 -0800 (PST) Sender: rhuang@agate.berkeley.edu Message-ID: <365ADBBE.31994D0@cs.berkeley.edu> Date: Tue, 24 Nov 1998 08:15:58 -0800 From: Randy Huang Organization: UC Berkeley Computer Science Division X-Mailer: Mozilla 4.07 [en] (X11; I; SunOS 5.5.1 sun4u) MIME-Version: 1.0 To: Michael Chu CC: rhuang@cs.berkeley.edu, Eylon Caspi , @cs.berkeley.edu Subject: notes References: Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Content-Length: 570 Status: RO X-Status: A X-Keywords: X-UID: 15 Nov 18, 1998 1) Eylon reduced adpcm latency from 35 slow to 16 slow at the expense of more hardware 2) compute page class need stream width 3) max wires in and our of CP is 160 (c = 10, p = 0.67) 4) need to be more realistic in programming 5) every CMB has Local Address Register and Bound register, controlled by processor 6) no nimble mode for CMB, but CMB is byte addressable 7) future CMB hardware upgrade to support stream 8) hardware support cycle counters 9) How does apps handle CMB full? 10) arch support needed to "allocate" memory to some boundary From amd@growl.CS.Berkeley.EDU Tue Nov 24 21:36 PST 1998 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id VAA12716; Tue, 24 Nov 1998 21:36:25 -0800 (PST) Received: from growl.CS.Berkeley.EDU (growl.CS.Berkeley.EDU [128.32.35.152]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id VAA03052; Tue, 24 Nov 1998 21:36:24 -0800 (PST) Received: (from amd@localhost) by growl.CS.Berkeley.EDU (8.8.3/8.6.9) id VAA07398; Tue, 24 Nov 1998 21:36:23 -0800 (PST) Date: Tue, 24 Nov 1998 21:36:23 -0800 (PST) Message-Id: <199811250536.VAA07398@growl.CS.Berkeley.EDU> To: rhuang@CS.Berkeley.EDU, eylon@CS.Berkeley.EDU, mmchu@CS.Berkeley.EDU Cc: amd@CS.Berkeley.EDU Subject: deadlock -- to bubble or not to bubble From: "Andre' DeHon" Reply-To: "Andre' DeHon" Organization: UC Berkeley BRASS X-face: zXx,|&]HuS;w?NU71cDc=Q}P`'1 JvDDtN[*09(DjFxt<&rg',Nz"|5#Bv<,H8?e"Y]S>8V=F4&wP6NO/! |H~lYV'++i9#G`?qx$0G^i|+M|WNt[yF5W#*>%X[NW0" Content-Type: text Content-Length: 1996 Status: RO X-Status: X-Keywords: X-UID: 16 Eylon and I wandered into an interesting question today. If we take the default operational mode we've been assuming (run page for one cycle only when inputs available), does that create deadlock possibilities which did not exist in the original token flow model? For a cyclic design where the cycle is equal to the latency through the page, this is the only thing we can do. However, for a non-cyclic design (and, to some extent, a cyclic design where the cycle length is smaller than the input->output token latency), we may be changing the semantics from the original graph. e.g. if we had a feed forward page with N cycles of delay, what happens when it is given a single input token (or token set)? In the token flow semantics, an output token can be produced N cycles later. In our current operating mode, no output can really be produced until N more input token sets arive. Now, if the original graph depended on single token flowing through w/out more token coming in...we may now create a deadlock situation that did not exist before. Now, maybe this can't happen unless the pages are really contained in some macro feedback loop? or maybe it just exacrabates buffer size issues. ...or, maybe it means we have to take some care in analyzing subgraphs to see if we've introduced any such problems (but I fear this gets close to the halting problem). It may mean we have to support bubble-through pages. One idea that seems like it might work (though maybe not that desirable) is to have each primitive page either support bubble-through or cyclic dependence mode. We cut the graph into purely cyclic or purely forward pages. Then we run each page in the appropriate mode. My guess it this won't be a problem for your designs (as long as you're careful about end-of-stream endpoint conditions), but we need to think through this in general and figure out how we need to handle it (in partitioning, in page logic, in hardware....). Andre' From amd@growl.CS.Berkeley.EDU Tue Nov 24 22:09 PST 1998 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id WAA13363; Tue, 24 Nov 1998 22:09:34 -0800 (PST) Received: from growl.CS.Berkeley.EDU (growl.CS.Berkeley.EDU [128.32.35.152]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id WAA03519; Tue, 24 Nov 1998 22:09:33 -0800 (PST) Received: (from amd@localhost) by growl.CS.Berkeley.EDU (8.8.3/8.6.9) id WAA07416; Tue, 24 Nov 1998 22:09:32 -0800 (PST) Date: Tue, 24 Nov 1998 22:09:32 -0800 (PST) Message-Id: <199811250609.WAA07416@growl.CS.Berkeley.EDU> To: eylon@CS.Berkeley.EDU, rhuang@CS.Berkeley.EDU, mmchu@CS.Berkeley.EDU, amd@CS.Berkeley.EDU Subject: possibly not a deadlock issue, just a buffer size issue? From: "Andre' DeHon" Reply-To: "Andre' DeHon" Organization: UC Berkeley BRASS X-face: zXx,|&]HuS;w?NU71cDc=Q}P`'1 JvDDtN[*09(DjFxt<&rg',Nz"|5#Bv<,H8?e"Y]S>8V=F4&wP6NO/! |H~lYV'++i9#G`?qx$0G^i|+M|WNt[yF5W#*>%X[NW0" Content-Type: text Content-Length: 1005 Status: RO X-Status: X-Keywords: X-UID: 17 In Buck94 thesis, he notes (but is only citing results from elsewhere so proof isn't here) that marked graphs is live iff the number of tokens on each cycle is at least one. We maintain that property by our transformation of the graph into pages (we can model the atomic firing of a page as an implicit feedback arc equal to the pipeline depth of the page, producing one token into the feedback arc on each firing). He also notes that a live marking is "safe" iff every place (input buffer) is in a cycle and every cycle has exactly one token. I suspect we don't meet that (multiple tokens in each cycle), but "safeness" simply means we can get away with fixed length buffers. So, the implication may be (again) on guaranting buffer size bounds rather than on inherent deadlock. Unfortunately, his primary reference for this theory is 1972 technical report from "Massachusetts Computer Associates" ... ...so I feel a little better about this. Definitely need to dig further... andre' From mmchu@ribbit.CS.Berkeley.EDU Sun Nov 29 22:23 PST 1998 Received: from localhost by ribbit.CS.Berkeley.EDU (8.9.1a/8.9.1) with SMTP id WAA25088; Sun, 29 Nov 1998 22:22:59 -0800 (PST) Date: Sun, 29 Nov 1998 22:22:59 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [CS262] Notes 11/23/98 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1654 Status: RO X-Status: X-Keywords: X-UID: 18 - Important coding style changes: - all pages must instantiate PageInput/PageOutput variables themselves. PageInput(width, perStep); PageOutput(width); - howSlow() -> betweenFirings() - Eylon has gratiously agreed (or was coerced into doing by Randy/Michael) to do/code up the initial simulator. - Separately with Randy/Michael: - The interface from the scheduler and simulator will be: Sim.run(ComputePage[], NUMCYCLES); ^- w/ placement information. - it will return when either: - the number of cycles has been run. - some other situation occurs: buffer full/empty/page done? - The scheduler needs to examine: - simulation timestep. - buffers full?/empty? - buffer exact fullness? - which pages are done? - which pages have sufficient inputs to proceed. - Assume: - each compute page is preloaded (config/state) into 1 or more CMBs (warm start) (to avoid initial context load cost). - must explicitly tell simulator that we are moving a page (for cost). - simulator can do a diff on the compute pages given to it to see if a page is swapped in/out (for cost accounting). - the simulator must account for moving config/state data - Personal note: CMB must be settable to be addressable in different memory sizes (1 byte -> 8 bytes?) - NOTE: Many of the ideas in these notes will be superceded by future notes! ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@ribbit.CS.Berkeley.EDU Sun Nov 29 22:38 PST 1998 Received: from localhost by ribbit.CS.Berkeley.EDU (8.9.1a/8.9.1) with SMTP id WAA25609; Sun, 29 Nov 1998 22:37:59 -0800 (PST) Date: Sun, 29 Nov 1998 22:37:59 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [CS262] Notes 11/24/98 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1024 Status: RO X-Status: X-Keywords: X-UID: 19 NOTE: This meeting was between Eylon/Michael. - The Scheduler/Simulator interface should be the following (roughly): load_config(ComputePage) dump_config(ComputePage) dump_state(ComputePage) drain_net() get_stats() do_steps() or simulate() CMB_to_CMB(CMB, CMB); PRIM_to_CMB(int[], CMB); CMB_to_PRIM(CMB, int[]); - We will have stylized step()s, each step will consume 1 input set. - We should have exact interrupts: run steps to completion and then ignore the outputs that are over the time slice boundary. - Questions to answer: - How to drain the network on disconnect? How to drain the network on reconnect? - When does the simulator collect page stats? (thru net) and report to scheduler? - Scheduler only gets upcalled after a timeslice. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@ribbit.CS.Berkeley.EDU Sun Nov 29 23:07 PST 1998 Received: from localhost by ribbit.CS.Berkeley.EDU (8.9.1a/8.9.1) with SMTP id XAA26280; Sun, 29 Nov 1998 23:07:02 -0800 (PST) Date: Sun, 29 Nov 1998 23:07:02 -0800 (PST) From: Michael Chu Reply-To: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [CS262] Notes. 11/25/98 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1203 Status: RO X-Status: X-Keywords: X-UID: 20 - The signalDone() method will take a # of cycles argument as to how far in the future the done signal should occur. - Output must be queued internally for a pipelined page. What this means is that, internally for a page, you have to have an array you advance yourself and don't queue them up immediately... Therefore, for a pipelined page (i.e. pipelined adder) you will always be queuing up tokens as latency 1. - As a consequence, you need to preload the registers in a compute page and also the stream buffers (also postpad the end of stream so the done signal can make it through). Simulator (FPGA) - event loop: token maturation done signals timer events microprocessor interrupts context_load() state_load() state_store() mm_cmb_cmb() mm_prm_cmb() mm_cmb_prm() startpage() stoppage() Microprocessor (scheduler) - event loop workload done signal timer array interrupt Simulator upcalls to microprocessor. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From eylon@cs.berkeley.edu Mon Nov 30 12:13:36 1998 Date: Mon, 30 Nov 1998 12:13:35 -0800 (PST) From: Eylon Caspi To: Randy Huang , Michael Chu cc: Andre' DeHon Subject: [CS262] meeting notes 11/29/98 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO X-Status: X-Keywords: X-UID: 21 (1) On time-sliced operation - interaction between processor & array possible on several time-scales: (i) on every page stall (too fine grained) (ii) on page stalls requiring swapping in non-resident page (e.g. read past end of CMB memory?) (alternatively, after fill/drain fraction (f) of a CMB) (iii) after N firings (requires HW counters & array->processor interrupts) (iv) on timer interrupt (non-pre-emptive: swap out stalled pages only) ( pre-emptive: swap out active pages too) - cost assumptions: (i) array page context swaps are "very" expensive (state dump, state load, context load, ~1000 cycles) (--> Q: test chip needs 1300 cycles? 3000?) (ii) communication with processor is expensive (e.g. bubbling page stats & commands up/down the array network, + to/from processor, ~100 cycles?) (iii) inter-page communication is cheap (after configure) (<~20 cycles latency?) (iv) page firings are short (<~10 cycles) - page context swaps + processor communication are expensive, MUST be infrequent - time-sliced operation (iv) seems best - arguments against informing processor immediately when a page stalls: (i) increases frequency of array-processor communication and page context swaps (probabilistically) (ii) page context swap stalls an array subtree (below some switch), may side-effect other running pages - time-slices work well for: (i) coresident computation (ii) partially-resident feed-forward computation through CMBs - time-slices waste time for: (i) partially-resident, feedback-dominated computations (require swap multiple times in each feedback cycle, so spend most of time-slice stalled) BUT such computations are necessarily swamped by context-swap time, bad candidates for FPGA, better on processor. (2) List scheduling - appropriate because: (i) resource-constrained scheduling algorithm (ii) does static or dynamic scheduling (iii) popular for multi-processor scheduling - on every scheduling decision (e.g. on time-slice), run pages which are: (i) runnable (ii) top priority - differences from multi-processor: (i) MP has "task" precedence; array pages have firing precedence (i.e. data precedence) (ii) MP assumes cheap context swaps; expensive on array - scheduling page firings (in analogy to MP tasks) is too fine grained; will schedule runs of firings in time-slice - page priority depends on: (i) amount of input available from CMB (#tokens or memory size) (ii) space left in output CMB (#tokens or memory size) (iii) number of successors (iv) number of tokens produced per firing ((iii)+(iv) are important w.r.t. subsequent fanout from CMB) (v) firing latency (vi) distance to farthest data-flow consumer (#intermediate pages) (vii) size of feedback cycle(s) containing this page (how big a cycle fits coresident on array) (viii) what else? (3) Page clustering (still raw) - feedback cycles which fit in the array should run coresident (feedback cycles which do not fit in the array are hopeless) - for each runnable page, need to know: (i) all data-flow cycles containing it (ii) size (#pages) and delay (#firings) in each cycle (iii) which cycles (subcycle?) fit on array - eviction policy? priority function will express how important it is to evict (pre-empt) running pages to make room for an entire coresident cycle Eylon Caspi University of California, Berkeley eylon@cs.berkeley.edu Electrical Engineering & Computer Science tel. 510-843-8689 From mmchu@screech.CS.Berkeley.EDU Mon Nov 30 16:57 PST 1998 Received: from screech.CS.Berkeley.EDU (screech.CS.Berkeley.EDU [128.32.36.94]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id QAA17912; Mon, 30 Nov 1998 16:57:31 -0800 (PST) Received: from localhost (mmchu@localhost) by screech.CS.Berkeley.EDU (8.8.3/8.6.9) with SMTP id QAA06595; Mon, 30 Nov 1998 16:57:30 -0800 (PST) Date: Mon, 30 Nov 1998 16:57:29 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [CS262] Schedule of remaining days. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 4382 Status: RO X-Status: X-Keywords: X-UID: 22 After tomorrow, I am putting 100% brain power into CS262 (no more CS294-6!) So, the MOST important thing to think about now is, where do we want to be on EACH of the next 10 days left... Tomorrow, we should definitely set down a REAL schedule so we know immediately if we fall behind (not that we aren't already...) Also, it helps us end arguments quicker (since, we will know we have to make certain decision by certain amount of time, or else...) So, here are the days left, and what I think should be some of the deadlines (these are what I believe should be done by the time we leave Soda that day): Tuesday, December 1: The user interface (for apps) is fully nailed down. Also, what kind of interface/etc. the simulator and scheduler want out of the library components must be 100% nailed down. I will code up the library and make sure those interfaces are provided. Also, the interface between sched/sim should be 100% nailed down. Wednesday, December 2: All apps should be coded up completely in the new standard! They should compile/link totally. In addition, some sort of priority scheme (at this time, we no longer have the time to find the most correct one, but one that we believe will have a fighting chance of working...) should have been decided on. Thursday, December 3: Eylon will be temporarily not with us due to quantum. Randy and I will start coding up the scheduler as a list scheduler given the priority method determined at the time. Friday, December 4: Randy and I will have a scheduler completely coded that compiles/links against the library (and the simulator stubs). Saturday, December 5: Apps should be fully functionally tested to guarantee that they work! (so we are not later still determining whether it is the sim/sched or the app that is malfunctioning). Sunday, December 6: The simulator should be coded up initially! Monday, December 7: We should have a running app/simulator/scheduler program UP! At this point, ALL WORK WILL ONLY BE IN DEBUGGING! (and, once it is debugged, if we have time, think about minor tweeks to make the scheduling better). Tuesday, December 8: Debugging work! / Optimizing work! Wednesday, December 9: We should really start collecting stats and generating charts for the poster! We can, of course, continue optimizing work, but, it must be done in a SEPARATE directory, so that, if it breaks ANYTHING, we can simply fall back to our current data! Thursday, December 10: Project had better be done (in one form or another)! Or we are dead! Also, please let the group know if ANYTHING is planned for any of the days from now until December 11. I know Eylon has 1 last quantum homework set due on Friday so he may be partially out of commision end of thursday/friday (that's why I have it so that Randy and I are doing most of the deadline work those days). So, for me, I need (mentally) 1 day of relax on Friday (I want to try to leave Soda by 5:00pm that day) so that I can unwind. But, beyond that, I am willing to work around the clock on this. Also, I propose that, on Friday, we all take the rest of that day off and unwind again. Then, come back in, full force, Saturday/Sunday/Monday to finish off the paper. (the paper needs to be done by Monday night since Tuesday morning, I am going to be on a plane to Las Vegas). Of course, if, by some miracle, we are AHEAD of schedule, we should endeavor to keep working hard so that we remain ahead of schedule (since, it would be to all of our advantages to not pull an all nighter on Thursday, otherwise, we will be catatonic on Friday when Brewer comes around). Please, mail me your comments. (Just so the entire group knows, I will be dedicated my time tonight, Monday night, to working on CS294-6 since, Tuesday morning is my last obligation for that class.) Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@screech.CS.Berkeley.EDU Tue Dec 1 18:22 PST 1998 Received: from screech.CS.Berkeley.EDU (screech.CS.Berkeley.EDU [128.32.36.94]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id SAA27770; Tue, 1 Dec 1998 18:22:47 -0800 (PST) Received: from localhost (mmchu@localhost) by screech.CS.Berkeley.EDU (8.8.3/8.6.9) with SMTP id SAA10187; Tue, 1 Dec 1998 18:22:46 -0800 (PST) Date: Tue, 1 Dec 1998 18:22:46 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [CS262] December 1, 1998 Todos. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 4229 Status: RO X-Status: X-Keywords: X-UID: 23 So, here are the things we had agreed to do Dec. 1, 1998 for Dec. 2, 1998: 1) Change any random access memory reference in your application to, instead, use the CMBInterfaceRAM class. If you look in Eylon's file: ~eylon/cs262/project/simulator.java, assume that, where it says Stream, it means PageInput/PageOutput (where appropriate) so that the connect() method still works correctly. Also, I would say that the input and output names should be: addr, dataIn, dataOut, rwMode. When you queue up a token for rwMode from your compute page, queue it up as either: CMB.rwModeRead or CMB.rwModeWrite. The addressing granularity is implied through the width of the datain/out streams. 2) If you have any sequential memory access (like, your design is outputting tokens that should just be stored sequentially in memory), use the CMBInterfaceSeq class. I assume that there should some kind of interface that allows you to set this... Perhaps, it is something that is setup in the constructor? (for now, let's just say it is a constructor like: CMBInterfaceSeq(int baseaddr, int stride, int rwMode); We will change those specific pieces of code later when this is resolved (don't worry, it will easily show up due to compiler syntax checking). Also, assume that a CMBInterfaceSeq has PageInput of dataIn, and PageOutput of dataOut (I know that is not what is in there, but that is what should be there). 3) Currently, my understanding is that, if you need a CMB as RAM, you instantiate it as either CMBInterfaceRAM or CMBInterfaceSeq, and then , in your design (of type Design), you use the standard connect() call (connect(SRCPAGE, "SRCPAGEOUTPUT", SINKPAGE, "SINKPAGEOUTPUT")) to connect the addr/dataIn/dataOut/rwMode PageInputs/Outputs. 4) In your designs, if you have a pipelined page, you should use the TokenQueue class and, first, queue it up with the appropriate amount of dummy data to fill your pipeline. Then, on every step() call, you will dequeue() 1 token to queue it on your PageOutput, and you will calculate 1+ tokens and queue them up on the TokenQueue queue. 5) If you have a pipelined page in an interpage feedback loop, you must ensure that inputs are preloaded correctly to avoid deadlock. Remember that a page will not be fired (step() called) until it has sufficient # of inputs preloaded on its input streams to make sure there is no deadlock. 6) After a done signal occurs in an input stream (application-specific), first, the signalDone(int delay) should be called with the appropriate delay (in cycles). Also, all of the inputs (including the done input stream), must be postpadded sufficiently to allow the done signal to propagate through the page before the done signal actually registers (once a done signal is registered, then no more output tokens should come out of the page). So, these the things we should all get done today (before tomorrow)... So please make sure you go through and modify your code as such... modify it this way first, and, if there are any changes, at least all of our code will be consistently changed... (of course, Eylon, I'd say, if you can look at the simulator first, that'd be your top priority since tomorrow, I definitely want to get the interface down pat and plan to make sure Randy and my code are 100% linkable against the library... Before I leave tomorrow, my goal is to make sure Randy and my kernels compile, link, and also, the base library code sets up as much of the state as it can... also, Randy, tomorrow, I'd also like for us to work on the scheduler, in the sense of having a paper version of the scheduler so Thursday, we will be coding it up... so, let's plan on that...)... (oh, implicit in this Eylon, is that, you don't have to have your kernel up to spec tomorrow, simulator and thinking out what you need from the library is more important...)... Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@hiss.CS.Berkeley.EDU Thu Dec 3 01:54:01 1998 Return-Path: mmchu@hiss.CS.Berkeley.EDU Received: from localhost (mmchu@localhost) by hiss.CS.Berkeley.EDU (8.8.3/8.8.2) with SMTP id BAA02029; Thu, 3 Dec 1998 01:53:57 -0800 (PST) Date: Thu, 3 Dec 1998 01:53:57 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [CS262] December 2, 1998 (Lunch meeting notes). Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 995 Status: RO X-Status: X-Keywords: X-UID: 24 NOTE: Only Eylon and I were attending. - Our first pass view of a CMB is one where, there is an arbitrary split of the 2 MBits into a singla data section and a context area which can hold several config/state of pages. The number of configurations that can be cached can be a "knob". - Since CMB can be filled up quickly by an output per cycle compute page, then, timeslices may have to be short in order to prevent pages sitting idle for most of the time slice. - we may want to support bit-wide streams in hardware (bit granularity addressing (not for RAM)). - Eylon talked with Andre' and Andre' believes that we should statically determine a feedback loop and cluster those together and schedule as an atomic unit. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@hiss.CS.Berkeley.EDU Thu Dec 3 02:15:13 1998 Return-Path: mmchu@hiss.CS.Berkeley.EDU Received: from localhost (mmchu@localhost) by hiss.CS.Berkeley.EDU (8.8.3/8.8.2) with SMTP id CAA02066; Thu, 3 Dec 1998 02:15:09 -0800 (PST) Date: Thu, 3 Dec 1998 02:15:08 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [CS262] December 2, 1998 (Evening meeting). Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 2127 Status: RO X-Status: X-Keywords: X-UID: 25 NOTE: Most of this meeting was with Randy and I, and Eylon came in afterwards. - Flowcharts of simulator. 1) After being upcalled by simulator, we check for any resident pages in the HSRA which can/should be freed. 2) Is the ready queue empty? - yes? - is the array empty? - yes: exit. - no: return to sim. - no? go to 3) 3) Are there any free page frames? - no? return to sim. - yes? go to 4) 4) Job selection: - first, place any virtual compute page that already have their configuration loaded in a free physical compute page. - then pick from top of ready Q and do random placement until we fill the free list or the ready Q is empty. 5) Return to sim. - Signals from the array: - CP: done, exception, specialization, stall (time, pageposition, inttype(sigdone,sigstalled,sigtimer)) - CMB: addr fault (also used to detect config/state load/dump done). - using official amd placement system 10 11 14 15 8 9 12 13 2 3 6 7 0 1 4 5 - Design: ComputePage[] getComputePages(); - ComputePage: boolean isSchedulable(); int stallGiveUpTime(); - Simulator: PageFrame[] getPageFrames(); void intMask(int mask); SRC DST load CMB STATE CMB dump CP CONFIG PM PM? - Scheduler: int schedule(EventQueue); ComputePage[] getStalledPages(); void freePages(ComputePages[]); ReadyQueue buildReadyQueue(); ------------------------------------------------------------------------ We moved the EventQueue from the parameter to a class variable! int doSchedule() { get queue tokens. make determinations (issue simulator commands). store our actions onto our scheduleQ (list of things that we requested of the simulator but not yet verified). return(ESTIMATED TIME TAKEN) } We get the batch of interrupts up until the up call... any interrupts that occur during the call are batched for next time. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@screech.CS.Berkeley.EDU Thu Dec 3 13:40 PST 1998 Received: from screech.CS.Berkeley.EDU (screech.CS.Berkeley.EDU [128.32.36.94]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id NAA06564; Thu, 3 Dec 1998 13:40:54 -0800 (PST) Received: from localhost (mmchu@localhost) by screech.CS.Berkeley.EDU (8.8.3/8.6.9) with SMTP id NAA17227; Thu, 3 Dec 1998 13:40:53 -0800 (PST) Date: Thu, 3 Dec 1998 13:40:53 -0800 (PST) From: Michael Chu Reply-To: Michael Chu To: Randy Huang cc: Michael Chu , Eylon Caspi Subject: [CS262] Today Meeting. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1957 Status: RO X-Status: X-Keywords: X-UID: 26 Randy, Let's meet at 4:00pm and get the first-pass scheduler coded up. I think that should be our goal tonight before we go (of course, we will not be able to actually test it since the simulator is in progress, but we can do our best up to that point). So, here is what I think our game plan should be the next few days: - Today: code up the Scheduler. If Eylon doesn't have his Simulator.java file checked out, we can make sure all the stub interfaces are there for us to compile against. - Tomorrow: let's start talking about different ways of assigning priorities (what parameters, etc.). Also, different ways of determining who to kick out when. (I want to leave campus at 5pm so I can take a mental breather before next week's hell week starts...) - Saturday: as a goal, let's have a writeup of a more sophisticated priority/scheduling scheme by the end of the day (something we might be not ashamed to turn in if it came down to that...) also, we should code up something that - Sunday: this should be the major TESTING day to make sure interfaces and EVERYTHING WORKS! This includes the actual functionality of the kernels. The simulator should be up by today! So, this is essentially the day we make sure we even have a running system. - Monday->Tuesday: By the end of this, we should have: - 1) a working system (with some kind of scheduling). - 2) coded up the more sophisticated scheduling. - Wednesday: We should probably copy our current working directory out to a separate directory and start running data simulations on it to get numbers for graphs. Simultaneously, we can investigate other algorithms (but will always have old data to fall back on... I found this helpful in CS294-6 to always have backup data to fall back on...) - Thursday: Let's wrap it up! Posters, data, etc. Michael. From rhuang@cs.berkeley.edu Tue Dec 8 14:34 PST 1998 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id OAA05560; Tue, 8 Dec 1998 14:34:09 -0800 (PST) Received: from agate.berkeley.edu (agate.Berkeley.EDU [128.32.155.1]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id OAA01595; Tue, 8 Dec 1998 14:34:09 -0800 (PST) Received: from cs.berkeley.edu (hiss.CS.Berkeley.EDU [128.32.131.214]) by agate.berkeley.edu (8.8.5/8.8.5) with ESMTP id OAA16440; Tue, 8 Dec 1998 14:34:08 -0800 (PST) Sender: rhuang@agate.berkeley.edu Message-ID: <366DA960.2E95A2F6@cs.berkeley.edu> Date: Tue, 08 Dec 1998 14:34:08 -0800 From: Randy Huang Organization: UC Berkeley Computer Science Division X-Mailer: Mozilla 4.07 [en] (X11; I; SunOS 5.5.1 sun4u) MIME-Version: 1.0 To: eylon@cs.berkeley.edu, mmchu@cs.berkeley.edu Subject: cs262 poster stuff Content-Transfer-Encoding: 7bit Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Content-Length: 1814 Status: RO X-Status: X-Keywords: X-UID: 27 CS262 Project Poster Title: Scheduling on a Reconfigurable Processor with Virtualized Hardware Class: CS262 People: Eylon Caspi, Michael Chu, Randy Huang Project ERL: http://www.cs.berkeley.edu/~eylon/cs262/ Part of BRASS Project: http://www.cs.berkeley.edu/projects/brass/ Purpose and problem what's a reconfigurable processor what's virtualized hardware why scheduling on a reconfigurable processor with virtualized hardware Methodology (of this study) propose a programming model write applications using this programming model triangle rasterization: # of CP, how slow, cycle adpcm: # of CP, how slow, cycle wavelet transform: # of CP, how slow, cycle build a simulator event simulator... build a scheduler centralized schedule running on the microprocessor what are the invariants to make sure jobs will finish run different scheduling algorithms understand architecture tradeoff Scheduling Algorithm build readyQ priority list 0. path-based a. not sorted (base case) b. sort by context resident c. sort by CMB resident d. sort of DFS e. clustering f. combination job selection Architecture Tradeoffs on chip, off chip configuration/state time scheduling cost # of configuration inside CMB data size # of frames Result graphs, tables of results of different scheduling algorithm effects of turning different architecture tradeoffs Conclusion Future Work take upper level switch box into cost model memory management References From mmchu@screech.CS.Berkeley.EDU Wed Dec 9 07:20 PST 1998 Received: from screech.CS.Berkeley.EDU (screech.CS.Berkeley.EDU [128.32.36.94]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id HAA29134; Wed, 9 Dec 1998 07:20:25 -0800 (PST) Received: from localhost (mmchu@localhost) by screech.CS.Berkeley.EDU (8.8.3/8.6.9) with SMTP id HAA10089; Wed, 9 Dec 1998 07:20:24 -0800 (PST) Date: Wed, 9 Dec 1998 07:20:23 -0800 (PST) From: Michael Chu To: Randy Huang cc: Michael Chu , Eylon Caspi Subject: [CS262] Status. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1654 Status: RO X-Status: X-Keywords: X-UID: 28 Randy, So here's our status: 1) Eylon and I have completed coding... (I believe that, when you read this mail, we will have syntax checked it via javac). 2) I wrote 2 simple tests: /project/cs/brass/a/tools/brass/develop/cs262_fa98/tests SingleComputePage_noIO SingleComputePage I haven't had a chance to compile against the library yet. They should be pretty self explanatory. 3) I started doing some of the conversion of my app to the correct style. Then I realized it was more important to create simple tests (since, very likely, we are not going to be running triangle rasterizer... could be wrong, but let's think a little smaller now...) So, I suspect that, unless you are adventurous, you will not be running the simulation tests alone... therefore, I would suggest you do the following: 1) Integrate your additional optimizations into Scheduler.java (I have it checked in). 2) Write some more basic tests (to test basic ideas). Also, write up synthetic benchmarks on the chance that we don't get to run our apps. 3) Continue work on poster/paper. I will try to be in 1pm. If you and Eylon are in Soda together earlier than that, then by all means go ahead and start full testing/debugging. Randy should roughly know what my code does (although I suspect that I would need to be there for debugging my code...) Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@boom.CS.Berkeley.EDU Sun Dec 13 05:58 PST 1998 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.0/) with ESMTP id FAA00044; Sun, 13 Dec 1998 05:58:33 -0800 (PST) Received: from boom.CS.Berkeley.EDU (boom.CS.Berkeley.EDU [128.32.131.183]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id FAA05994; Sun, 13 Dec 1998 05:58:32 -0800 (PST) Received: from localhost (mmchu@localhost) by boom.CS.Berkeley.EDU (8.8.3/8.8.2) with SMTP id FAA14792; Sun, 13 Dec 1998 05:58:31 -0800 (PST) Date: Sun, 13 Dec 1998 05:58:31 -0800 (PST) From: Michael Chu To: Randy Huang cc: Michael Chu , Eylon Caspi Subject: [CS262] Current Status. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1817 Status: RO X-Status: X-Keywords: X-UID: 29 Randy, Current status of the code is: 1) it currently works! I ran all of your tests (big/frame0/frame1/ frame2) and the results are in result.* 2) obviously, if you find something else wrong, I would recommend the procedure be: a) first try to work around it. b) if you cannot work around it, then report it. We just can't afford to keep continuing in this debug cycle if we are going to get the paper out (Randy cannot work on Monday and I cannot afford to be working past the afternoon on Monday since I have to prepare for my trip). So, Randy, when you come in, see if there are anything else you want to run... Eylon is currently changing CMB so I don't know if anything else will break... but, I have made a copy of the code in: ~mmchu/backup_121398_054843.tar.gz Then, if you can, see if you can get a jump start on paper writing so there isn't that wasted time between when you are up and when we are in (the good thing about our group is that, since we have such skewed working times that, we can actually have someone working on the project through most hours of the day... :-) That means, we should require less time... hopefully...) I will be going home in a little bit to sleep and will try to get here at 1pm (I will REALLY try for this since I want to try to maximize my overlap with Randy...)... Randy, try not to break the code... :-) (oh, BTW, just to let you know, I set the default for the CMB/CP ratio to 1 instead of 4... you have to explicitly change it for something else...). Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From eylon@cs.berkeley.edu Sun Dec 13 07:53:24 1998 Date: Sun, 13 Dec 1998 07:53:23 -0800 (PST) From: Eylon Caspi To: Randy Huang , Michael Chu Subject: [cs262] utilization counters Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO X-Status: X-Keywords: X-UID: 30 Howdy. I added utilization meters to the simulator. At the end of every time slice, it prints out a message like: === end time slice 24000: === 10 CPs fired 2830 cycles (28.3% on) === 5 CMBs fired 1396 cycles (27.92% on) === total utilization: 21.13% on This means that during the time slice (of 1000 cycles) ending at time 24000, 10 CP's were loaded, and fired for a collective total of 2830 cycles. That's 2830 cycles out of a collective 10,000 CP-cycles, i.e. 28.30% utilization in these loaded CPs. Similarly, 5 CMBs were loaded in this time slice, firing for a collective 1396 cycles, out of 5000 CMB-cycles, makes 27.92% utilization in these CMBs. Over the entire array, 15 pages fired for 2830+1396 cycles, out of the array's total 16*2*1000 page-cycles, for a total utilization of 21.13%. These rather optimistic numbers were generated by: java -mx100M RandomGraph -f 10 -n 20 -p 18 -e 2 -d 1000 -s 11 -t 1000 -simdebug -eventdebug -silence -t 1000 Note that this was the first useful time slice in which ANY pages fired. The first 23000 cycles were spent in context loads. So are the next 23000. The next time slice with any useful work produces the following: === end time slice 56000: === 10 CPs fired 114 cycles (1.14% on) === 8 CMBs fired 352 cycles (4.4% on) === total utilization: 2.33% on Utilization stays fairly abysmal from there on out. Surprisingly, increasing the number of frames does not help much. The scheduler will not co-schedule more than 19 CPs on the array, even if the array is large enough (e.g. numFrames=40). How do we fix that? To get these time statistics, use "-simdebug -eventdebug". -silence should speed things up some. grep for '^==='. This will also show a counter every 1024 firings on the array, to make sure that something is still active. Note, utilization is presently reported independently for each time slice. I don't keep running tallies yet, but that's easy to add. Sleep. Eylon Caspi University of California, Berkeley eylon@cs.berkeley.edu Electrical Engineering & Computer Science tel. 510-843-8689 From eylon@cs.berkeley.edu Wed Dec 16 09:50:36 1998 Date: Wed, 16 Dec 1998 09:50:36 -0800 (PST) From: Eylon Caspi To: brass@cs.berkeley.edu Subject: CS262 SCORE scheduling paper Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO X-Status: X-Keywords: X-UID: 31 The CS262 SCORE scheduling project, done by Michael, Randy, and Eylon, is complete. The project report and poster are available for public consumption at: http://www.cs.berkeley.edu/~eylon/cs262/ Take care. Eylon Caspi University of California, Berkeley eylon@cs.berkeley.edu Electrical Engineering & Computer Science tel. 510-843-8689 From mmchu@boom.CS.Berkeley.EDU Mon Jan 11 16:30 PST 1999 X-UIDL: 39857105e4d0d7ff8ec5f3d2a10350f9 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.1a/) with ESMTP id QAA11152; Mon, 11 Jan 1999 16:30:38 -0800 (PST) Received: from boom.CS.Berkeley.EDU (boom.CS.Berkeley.EDU [128.32.131.183]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id QAA05937; Mon, 11 Jan 1999 16:30:33 -0800 (PST) Received: from localhost (mmchu@localhost) by boom.CS.Berkeley.EDU (8.8.3/8.8.2) with SMTP id QAA13300; Mon, 11 Jan 1999 16:30:32 -0800 (PST) Date: Mon, 11 Jan 1999 16:30:32 -0800 (PST) From: Michael Chu To: "Andre' DeHon" cc: Michael Chu , Randy Huang , Eylon Caspi Subject: Ideal curve. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1038 Status: RO X-Status: X-Keywords: X-UID: 32 Andre', When you said plot an ideal curve for array size vs. makespan (page 20 on the slides), did you mean, for each small/med/large design for each of the # of CP sizes, to figure out, if we statically scheduled the design, what the minimum makespan is? This would mean that, for each, we would have to take into account: - CMB availability. - reconfig time. - signal propagation time. (maybe not... since our datasize is relatively small). Also, would we need to model the timeslices for the ideal case? (since there may be cycles in the designs small/med/large). I am not sure if there is an easy way to do this (at least in the next day) for all of the data points. Ideas? Also, I assume that you were asking for the ideal curve only on the page 20 graph? Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From amd@growl.CS.Berkeley.EDU Mon Jan 11 17:15 PST 1999 X-UIDL: 5fbd88a4ba7349bdf60135d28961cf41 Received: from growl.CS.Berkeley.EDU (growl.CS.Berkeley.EDU [128.32.35.152]) by mnemosyne.CS.Berkeley.EDU (8.9.1a/) with ESMTP id RAA12245; Mon, 11 Jan 1999 17:15:27 -0800 (PST) Received: (from amd@localhost) by growl.CS.Berkeley.EDU (8.8.3/8.6.9) id RAA09990; Mon, 11 Jan 1999 17:15:22 -0800 (PST) Date: Mon, 11 Jan 1999 17:15:22 -0800 (PST) Message-Id: <199901120115.RAA09990@growl.CS.Berkeley.EDU> To: mmchu@boom.CS.Berkeley.EDU CC: mmchu@boom.CS.Berkeley.EDU, rhuang@boom.CS.Berkeley.EDU, eylon@boom.CS.Berkeley.EDU, amd@CS.Berkeley.EDU In-reply-to: (message from Michael Chu on Mon, 11 Jan 1999 16:30:32 -0800 (PST)) Subject: Re: Ideal curve. From: "Andre' DeHon" Reply-To: "Andre' DeHon" Organization: UC Berkeley BRASS X-face: zXx,|&]HuS;w?NU71cDc=Q}P`'1 JvDDtN[*09(DjFxt<&rg',Nz"|5#Bv<,H8?e"Y]S>8V=F4&wP6NO/! |H~lYV'++i9#G`?qx$0G^i|+M|WNt[yF5W#*>%X[NW0" Content-Type: text Content-Length: 2332 Status: RO X-Status: X-Keywords: X-UID: 33 No. I mean *really* ideal -- if the problem can be solved with in 10^6 cycles with 14 pages, then (very naively) it can be solved in 2*10^6 w/ 7 pages 3*10^6 w/ 5 pages 4*10^6 w/ 4 pages 7*10^6 w/ 2 pages 14*10^6 w/ 1 page (actually, you'd usually work the other way, but I'm assuming your single page times are dominated by reconfiguration, so scaling down from there isn't reall the right thing. What I just did isn't really right either, because it may only take 10^6 cycles on 14 pages because of the makespan, so the ideal, single page version might be less than 14*10^6 pages...anyway, it's good enough for what I wanted for your presentation). The point is that there is an idea area*time=constant curve (so time = 1/area) assuming: * no swapping overhead * no resource (CMB) limits * no communication time * no dependencies limiting parallelism At some point I really would like to define tighter bounds (1) assuming swapping is free, so primarily/only have dependencies, what bound do we get? (2) maybe do something like you outline w/ static scheduling ...but not for the retreat. Can we get (1) in the stuff we're doing over the next month? Clear enough? Andre' > > Andre', > When you said plot an ideal curve for array size vs. makespan (page 20 > on the slides), did you mean, for each small/med/large design for each of > the # of CP sizes, to figure out, if we statically scheduled the design, > what the minimum makespan is? This would mean that, for each, we would > have to take into account: > - CMB availability. > - reconfig time. > - signal propagation time. (maybe not... since our datasize is > relatively small). > > Also, would we need to model the timeslices for the ideal case? (since > there may be cycles in the designs small/med/large). > > I am not sure if there is an easy way to do this (at least in the next > day) for all of the data points. Ideas? > > Also, I assume that you were asking for the ideal curve only on the page > 20 graph? > > Michael. > > ------------------------------------------------------------------------- > Michael Chu mmchu@cs.berkeley.edu > University of California, Berkeley http://www.cs.berkeley.edu/~mmchu > ------------------------------------------------------------------------- > > From timothyc@hum.CS.Berkeley.EDU Fri Jan 15 17:41 PST 1999 Received: from alpaca.eecs.berkeley.edu (alpaca.EECS.Berkeley.EDU [128.32.244.153]) by mnemosyne.CS.Berkeley.EDU (8.9.1a/) with ESMTP id RAA19549 for ; Fri, 15 Jan 1999 17:41:48 -0800 (PST) Received: from hum.CS.Berkeley.EDU ([128.32.36.60]) by alpaca.eecs.berkeley.edu (8.9.1a/8.6.6.Beta11) with ESMTP id RAA26408; Fri, 15 Jan 1999 17:41:42 -0800 (PST) Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by hum.CS.Berkeley.EDU (8.8.3/8.8.2) with ESMTP id RAA07374 for ; Fri, 15 Jan 1999 17:41:38 -0800 (PST) Received: from growl.CS.Berkeley.EDU (growl.CS.Berkeley.EDU [128.32.35.152]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id RAA18916 for ; Fri, 15 Jan 1999 17:41:36 -0800 (PST) Received: (from amd@localhost) by growl.CS.Berkeley.EDU (8.8.3/8.6.9) id RAA19030; Fri, 15 Jan 1999 17:41:35 -0800 (PST) Date: Fri, 15 Jan 1999 17:41:35 -0800 (PST) Message-Id: <199901160141.RAA19030@growl.CS.Berkeley.EDU> To: brass@cs.berkeley.edu Subject: raw feedback From: "Andre' DeHon" Reply-To: "Andre' DeHon" Organization: UC Berkeley BRASS X-face: zXx,|&]HuS;w?NU71cDc=Q}P`'1 JvDDtN[*09(DjFxt<&rg',Nz"|5#Bv<,H8?e"Y]S>8V=F4&wP6NO/! |H~lYV'++i9#G`?qx$0G^i|+M|WNt[yF5W#*>%X[NW0" Content-Type: text Content-Length: 11555 Status: RO X-Status: X-Keywords: X-UID: 34 Here's the raw notes I took from the retreat feedback. We should have at least one other set from Randy/Joseph/Andreas. Andre' ------------------------------------------------------------------------ Steve Trimberger/Xilinx I'm seeing a lot of the same old faces want to see some graduate. Students, remember why you're here. Prof. help them graduate. Over the next year, I hope to see a lot of these people gone (they've succeeded at what they came to berkeley for). Focus is good. healthy, like to see that. Like John's single bin (corner turn) routing -- worth pursuing. Whoever picks that up... something that HSRA had (unique in FPGA field) is model/formalism behind it. What is the formalism behind this. Theory/why I need X bins to get a certain level of confidence (can be done in university, but not industry). Interesting area. Interested in seeing formal model. (maybe you could "prove" it's better than HSRA, rather than imply) SCORE scheduling Like to see (proof) can't introduce deadlock. If design doesn't have it, scheduling won't introduce it. What assumptions are necessary to achieve that. That's very valuable. Got comment, nice if FPGA not burn themselves up --- similarly, nice to know scheduler won't lock up. Trumpet Testchip Important at this point in the project to capture what you've learned. You've gone through a difficult experience getting a chip through this experience. It's important that learning doesn't get lost. Emboddied in a few individuals (war stories?) --- tech report ...what we learned (maybe not exciting technical stuff). Take the extra time and treat it as logic worth perservinging. Knowledge won w/ a lot os sweat. SCOREBoard ...focus Several different ways to go with it. It is real critical to focus on what is vitally important -- what is it you must have on this thing. Perhaps a lesson of testchip is to prune down what you have early --- just do that thing. Simpiliest thing that addresses needs. If you wind up building your own hwardware, be generous with the stuff there for debugging. Does like the SCORE direction/focus in particular. It is a vital piece of RC flow that needs to be addressed. As you proceed, doing the important thing well is very important. Doing the broader thing fair won't get you as much. ------------------------------------------------------------------------ Anthony Stansfield/HP Bristol several bits to making RC GP.. ...like Drake equation, several things to work out chip/density yield tools ...several levels... in intro thiks we got too focussed on one level and hurt others. So, now we're handling areas which have been understaffed in past. If you've decided to focus on one area because you think you understand others--->focus on that, get everyone together on this. think about error bars. chip error bar small (you know from design) score error bar is large, so need to beat that down. this feeds into board (board, huge area of virtex...). It's technologically fun, but is it fitting into the big picture. If you can say you need high performance for a particular application/demo...fine. If it's not, then maybe not. Decide what you want and how to get there. Idea of SCORE sounds good. Kind of thing other people w/ FPGAs haven't done. If you can do that everyone's going to love it. One thing wanted to ask: Is Score tied to a homogeneous model. (amd: NO). ...different things want different arrays/types... if build a board, how going to test it debug. testing hasn't been mentioned. If you want to make heavily integrated systems/commercial...how are you going to test it. Tester time is expensive (money). clarify: from a research standpoint, design things which are testable (easier to design up front than to add in / kludge later). Ian: mumble Steve's antifuse comment about lack of ability to test. ------------------------------------------------------------------------ Ian Eslick/SiSpice Everytime I come, I find it enjoyable (things see, ideas get). Lot of work is interesting. Fan of where we're going with SCORE. Personally, that's most unique contributions this group can provide over the next 9 months. This is woefully lacking elsewhere -- no one else think through formal models of how full ystems work. SCORE only/first effort to address this in a general way. Lot of work in HSRA. Like to see the results well characterized. Both engineering (switchbox cost...)...and practice. These are the things that are interesting .... make/break. Remember that with SCORE. Lots of details to work out. I can see this group going in one of 2 directions 1) 5-6 separate projects w/ little results 2) pull things together to show (system?) works Work though problem of vertically integrating all of these ideas. ...schedule, tools, board->chip ...other half of interesting research (engineeringn research) Board. --> encourage us to look at a piece of hardware to make things work on and commit to making that work. Challenge: don't just need a board designer---need a student who own's it. Otherwise integration is really tough. Burn time, won't get much out of it. Rather than being aggressive up front---focus up front on what want board to accomplish. go ahead and tune it, to some extent, to some application. You can't get it all right on the first time. Take an initiall application focus---more likely to support that than if try to build GP board. Also, make sure you have a win w/out the board. Simulators... Clarify: simulate board or IC? Ian: neither, learn something about key points about big ideas ...two tracks, be carefull about them. They can come together, but don't defeat each other. Resources limited: interested to see john's early work on this network continued. What does it mean. Is it like HSRA. How do we formalize/generalize... Looked at routing problem --- did we push complexity elsewhere? (placement, hardware) ------------------------------------------------------------------------ Jon Stockwood/Synopsys first retreat. first exposure to some of this know about tim's stuff from synopsys...big fan of that. Want to draw further conclusions from that before taking a more generic approach. Direction seems to be moving away from that now. Do we have some baseline results from that work (tim's) on what we can achieve. ...and what can we get out of this new approach. Like to see it more together: FLAME/BOOM Excited about new work. SCORE. Dynamic scheduling interesting. Concerned about overhead: scheduling decisions/dynamic routing/reconfiguration. What are the tradeoffs. If can stream through for a long time, can drowned out overhead. How long to make scheduling decisions. How route between them. Need a routing architecture where can quickly route between pages. Long talk with John. Similarities with John's and Atmel FPGA. John's is a couple of steps further from ATmel. His approach seems to make more sense than this H-Tree approach. If you have two functional units physically adjacent on array should be able to route through them quickly. Still think point about physical silicon layout ...seems like HSRA less regular. Each time go to larger size, need larger main H-bar. Interesting tradeoffs. Gut feeling John's work is more interesting. Openning present -- said GP platform. Seems now you're trying to attack DSP type applications. Not sure if that's where you're focus is lying. are you going that way. ditto other guys? very impressed. * like focus/balance * like components at all levels down to hardware, * interested in score, seem what comes out of that * HSRA arch...memory in compute blocks looks interesting ...hadn't seen presented before ------------------------------------------------------------------------ Scott Weber/EECS CAD took course this term improvement in tools during term, but still some problems like SCORE model Nick's work with ML shows a lot of promise (not a lot of ML users). ...Tim's work, his personal reasons. ------------------------------------------------------------------------ Ian ML -- may not be that ML is the answer, but that low level langauges put lots of artificial constriants on what you have to do. High level language, avoid that, but often harder to map down (eliminate other overheads). Maybe mapping problems are result of not knowing structure of your problem. As go higher, maybe see/get more structure. So, may be very interesting if/as you can show that high-level languages give you better way to design these system-on-a-chip scale systems. Customers don't care about up front costs. Can it get me to market faster. If HL langauges get you there better/faster it's fascinating. ------------------------------------------------------------------------ Trimberger Very impressed w/ quality of presentations this time. Clarify: was it the projector. Yes, impressed, with that....but that's not what I was talking about. The effort that went into preparing talks. It was easy to sit through. People hit main points, focus on the right stuff, came out well. Not as convinced as others about the value of a new langauge for expressing functionallity. Probably a philosophical issue. I understand value of a language....limitations. Some think better in letters, some in pictures. Often letters on paper people propose solutions assuming everyone thinks that way. Arguing against taking to extreme -- I have a way of expressing functionallity to--this is the way that functionallity must be expressed in the future because it is better. Don't make the mistake of assuming this will displace all other ways of expressing functionallity. Don't claim too much. Keep mind open. Ian: If going to do HL langauage, easy to focus on one area. map works well for ...., but real langauges have a mix of other nasty stuff. Don't get hung up on single thing... Good application: Simulated Anealing places (we tried -- not as fast as buying new computers) Story: Long time ago, when I was deep in CAE industry. New toolset. Tightly integrated (beautiful). Demo. Stop at some point. ask how to make a large number of objects meta-(list/lisp....). Steve: What if you have a designer who doesn't know list. I don't know anybody who doesn't know list. Anthony: I once worked for a company who tried to sell new hardware and new langauge at once --- very difficult sale to make. If show new hardware and here's C compiler -- easier for them to buy. Then, maybe you can sell them new langauge when they run into things they can't do in C. Trojan horse approach. if you're going to have several different tool approaches -- interoperability is important. john clarify: is that important for us, as researchers. If you want your work to be recognized and accepted by lots of people ,making it easier for them to use is important. ...want to do ultimate vs. popular. john/amd clarify: focus ian: technology transfer path? people and brains, papers, chips ...helps find focus. if want people to play with, this integration is important. If just want to show model powerful, paper ok --- maybe not need integration. ------------------------------------------------------------------------ From timothyc@hum.CS.Berkeley.EDU Fri Jan 15 20:23 PST 1999 Received: from alpaca.eecs.berkeley.edu (alpaca.EECS.Berkeley.EDU [128.32.244.153]) by mnemosyne.CS.Berkeley.EDU (8.9.1a/) with ESMTP id UAA21820 for ; Fri, 15 Jan 1999 20:23:53 -0800 (PST) Received: from hum.CS.Berkeley.EDU ([128.32.36.60]) by alpaca.eecs.berkeley.edu (8.9.1a/8.6.6.Beta11) with ESMTP id UAA28788; Fri, 15 Jan 1999 20:23:47 -0800 (PST) Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by hum.CS.Berkeley.EDU (8.8.3/8.8.2) with ESMTP id UAA07394 for ; Fri, 15 Jan 1999 20:23:36 -0800 (PST) Received: from ICSI.Berkeley.EDU (fruitcake.ICSI.Berkeley.EDU [128.32.201.11]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id UAA20706 for ; Fri, 15 Jan 1999 20:23:35 -0800 (PST) Received: from ypocras.icsi.berkeley.edu (ypocras.ICSI.Berkeley.EDU [128.32.201.81]) by ICSI.Berkeley.EDU (8.9.0/8.9.0) with ESMTP id UAA00156 for ; Fri, 15 Jan 1999 20:23:30 -0800 (PST) Received: (akoch@localhost) by ypocras.icsi.berkeley.edu (8.8.2/1.8) id UAA04763 for brass@cs; Fri, 15 Jan 1999 20:23:33 -0800 (PST) Message-ID: <19990115202333.A4757@ypocras.ICSI.Berkeley.EDU> Date: Fri, 15 Jan 1999 20:23:33 -0800 From: Andreas Koch To: brass@cs.berkeley.edu Subject: More retreat notes (raw) Mime-Version: 1.0 X-Mailer: Mutt 0.93.2i Content-Type: text/plain; charset=us-ascii Content-Length: 6829 Status: RO X-Status: X-Keywords: X-UID: 35 Steve Trimberger - Xilinx * Seen old faces - get to know us - want to see us graduate - don't * lose focus on why you are here; professor needs to recognize this to * allow people to graduate; * Focus is good; start to see a theme; very healthy; like to see this * Like Hauser's single bent routing; worth pursuing; HSRA has a model; * unique in FPGA field; if someone is going to pick it up, has to ask * what is the formalism behind single-bent routing; need to find some * theory behind it; this is something that can be done in university, * not necessarily in the industry; interesting area; want to see * formal model; can it be proved that it is better than HSRA * Similar comment on score, scheduling; ask about deadlock; like to * see a proof about the scheduling algorithm is deadlock free; what is * the assumption to guarantee deadlock free; don't know what it is; * but it is valuable; industry can wave its hand on this problem, but * not university * Trumpet - important to capture what we have learned; project has * gone through a difficult learning experience; need to capture this * experience; what we have learned from actually gone through this * process; need to treat this as precious knowledge worth preserving; * Focus on scoreboard - several ways to go with the board; need to * focus on what we must have; learn the lessons from trumpet; prune * early; do the simpliest and critical thing; be generous on debugging * feature; it might not need vital (the debugging feature) * Particularly like the score direction/focus; vital piece of * reconfigurable computing; doing a particular thing well is important * vs doing a broad thing fair * Impressed with the qualities of the presentation; impressed with * projector; happy with the organization; everyone hits main points; * easy to shift through; * Is not as convinced as others in values in expressing functionality; * it's a philosophical belief; some people think better with * characters; someone think better with pictures; some what skeptical * especially a high-level language; arguing against "this is the way * functionality has to expressed because it is better"; cautioned * against "replacing existing methodology"; keep an open mind * Tell a story with the punch line, "I don't anyone who does not know * LISP" Jon Stockwood - Synopsys * First time at retreat * Big fan of Tim * Disappointed on cannot draw more conclusion before going to a * different conclusion; seem like brass is moving away from Tim's * approach; want to see baseline number, result, want to see more * tying up (boom and flame) * Admit it is kind of selfish since synopsys is adopting this approach * Like score * Like dynamic scheduling; concern about overheads (not just schedule, * but routing as well) * Imagine stream will run a long time, what is the cost tradeoff * Anthony Stansfield - Hewlett Packard * several thing need to address in making RC general purpose; chip; * fraction density; tools; apps to demonstrate; feel like we have * re-focus on area under resource in the past; it is important to * stick to our decision; does not go against scoreboard; went through * similar experience at HP; does not like a BIG scoreboard; brass * needs to think it through; make sure scoreboard contains what is * essential * like the idea of score; sounds like something no one is doing; if * brass can do it, everyone is going to love it; question is score to * tie to homogeneous array or heterogeneous; like to mix and match * array sizes; Answer: focus now on homogeneous * how to test scoreboard; testing has not been mention; brass seems to * assume things will just work; need to spend time on test strategies; * tester time is expensive; from the research stand point, design * things that are testable * From Ian: need to design processor cost efficient * Scheduling might need a routing architecture that supports these * runtime decisions * Something about atmel; Hauser approach might make more sense than * HSRA * Does not know about shortcuts, making some comments about things * logically close to each other will be physically close * Does not know about relocation property * Brass seems to be tackling traditional DSP application areas * Very impressed * Like the focus, balance, like vertical integration picture * Very interested in score work * Like HSRA architecture how it is able to load configuration in * parallel from CMB to Compute Page * Interoperability is important - ML to C; important if we want our * work to be "popular" Ian Eslick - Silicon Spice * enjoy talks and ideas; brass is doing interesting work * fan of score; want more work done; most unique contribution from * brass to the field; this field really lacks a model; people can * build boards, but do not have a model * like HSRA engineering data; for example, what did we not anticipate; * encourage to take HSRA lesson forward in score; * like to see brass projects tie together; making things all work; * working through vertical integration (hardware, cad, system). will * see traps that we will not usually see; do more engineering research * encourage scoreboard; need to make board realistic; need someone to * take charge; or else test time and integration will take too much * time * instead of being aggressive up front, focus on what do we want to * accomplish; can tune board to applications; we will get the board * wrong (memory balance, IO, how CPU is connected to array); * if doing a board, make sure there is a win by doing a simulator * making sure score can educate industry on key points; make two * tracks; making engineering effort to pull things together; make a * system work has tremendous values * want to see Hauser's work continues; still struggling for formalism; * need to generalize; work only in routing, might push complexities * into routing; how does this different from other approaches * wondering what high-level language will do about performance; see * opportunities in ML and high-level language will be able to capture * more structure * time to market is important; high-level might be able to enable * faster time to market * easy to focus in easy area when doing high-level language; need to * solve the whole program, not just kernel; use "map" expressing * mathematical equation as an example * methods of publicizing work helps to determine focus-do you want * people to play with it? Scott Weber * Like score model * Tools improve * Like ML, seems promising -- Andreas Koch Email : akoch@icsi.berkeley.edu International Computer Science Institute Phone : (510) 642-4274 Ext. 182 1947 Center Street, Suite 600 Phax : (510) 643-7684 Berkeley, CA 94704-1198, USA * PGP key available on request * From rhuang@hum.CS.Berkeley.EDU Sat Jan 16 16:59 PST 1999 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.1a/) with ESMTP id QAA01165; Sat, 16 Jan 1999 16:59:22 -0800 (PST) Received: from hum.CS.Berkeley.EDU (hum.CS.Berkeley.EDU [128.32.36.60]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id QAA28631; Sat, 16 Jan 1999 16:59:15 -0800 (PST) Received: from ribbit.CS.Berkeley.EDU (ribbit.CS.Berkeley.EDU [128.32.131.152]) by hum.CS.Berkeley.EDU (8.8.3/8.8.2) with ESMTP id QAA08774 for ; Sat, 16 Jan 1999 16:59:12 -0800 (PST) Received: (from amd@localhost) by ribbit.CS.Berkeley.EDU (8.9.1a/8.9.1) id RAA03118; Sat, 16 Jan 1999 17:00:34 -0800 (PST) Date: Sat, 16 Jan 1999 17:00:34 -0800 (PST) Message-Id: <199901170100.RAA03118@ribbit.CS.Berkeley.EDU> To: score@hum.CS.Berkeley.EDU Subject: Notes From: "Andre' DeHon" Reply-To: "Andre' DeHon" Organization: UC Berkeley BRASS X-face: zXx,|&]HuS;w?NU71cDc=Q}P`'1 JvDDtN[*09(DjFxt<&rg',Nz"|5#Bv<,H8?e"Y]S>8V=F4&wP6NO/! |H~lYV'++i9#G`?qx$0G^i|+M|WNt[yF5W#*>%X[NW0" Content-Type: text Content-Length: 2143 Status: RO X-Status: X-Keywords: X-UID: 36 Enclosed are our notes on what we want to get out of the current 262 simulator in the next 4 weeks. We want to review this and come to consensus at Monday's score meeting. I linked this and my notes from the focus meeting into the new Score page. Randy/Michael, can you get your notes typed up and linked in soon? Nick, please link in your ML slides from the retreat. Andre' ------------------------------------------------------------------------ Question: What do we want to get out of the existing/CS262 simulator? Mostly want to make work and collect numbers on current implementation and scheduler. * debug simulator so doesn't crash/deadlock * make application run + incl. testing features not yet tested (e.g. repeater) * run larger datasets/ avoid timeslice noise * capture baseline ideal comparisons: + full ideal (zero reconfig/data movement time) + maybe infinite size CMB? * fix costs (which?) + upper level swbox reconfig (and effects on other CPs?) * see where time goes --> visualize bottlenecks * CMB->CP ratios (Rent-like estimation of requisite growth rate) ? Memory management of CMB * Sensitivity analysis + reconfig time + CMB size + scheduler runtime + ? on-chip<->main-memory bandwidth ? + timeslice (redo w/ decent size datasets and other things fixed) ? cycle clustering Who modify simulator? rhuang using eylon for consultation (eylon work on MS) Who evolve/write next simulator? rhuang 1st week -> debug current simulator (visualization) Can't answer w/ current: ? Reader/write on single CMB Deadlock: * limited number of CMBs (can identify staticly) * buffer overflow In general (not necessarily next two weeks) * scheduling work - priority queue - run speed / efficiency - clustering offline, online, cycles * memory management * cost model ------------------------------------------------------------------------ amd, rhuang, mmchu, (nweaver) Bus leaving retreat From mmchu@boom.CS.Berkeley.EDU Tue Jan 19 12:23 PST 1999 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.1a/) with ESMTP id MAA17257; Tue, 19 Jan 1999 12:23:37 -0800 (PST) Received: from boom.CS.Berkeley.EDU (boom.CS.Berkeley.EDU [128.32.131.183]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id MAA18143; Tue, 19 Jan 1999 12:23:31 -0800 (PST) Received: from localhost (mmchu@localhost) by boom.CS.Berkeley.EDU (8.8.3/8.8.2) with SMTP id MAA18224; Tue, 19 Jan 1999 12:23:30 -0800 (PST) Date: Tue, 19 Jan 1999 12:23:30 -0800 (PST) From: Michael Chu To: Eylon Caspi , Randy Huang cc: Michael Chu Subject: [SCORE] Bug discovered (when unloading pages). Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1340 Status: RO X-Status: X-Keywords: X-UID: 37 I believe I have discovered the bug that was causing the simulator to complain when I was unloading pages... So, the problem is 2-fold: 1) The Scheduler, when it dumps a CMB to primary memory but does not reload a new CMB into that same frame, it assumes that CMB is now empty (functionally, to it, that CMB really is empty...) What this means is that it misses some opportunities to reuse the data in that CMB later on (pretty much, it may reload that CMB into another frame even though no one is using the frame the CMB used to be in)... Anyways... this is simply an efficiency issue (i.e. the Scheduler should be allowed to do this... nothing technically wrong with that...) 2) The Simulator, however, will not nullify CPs and CMBs in frames unless something overrides it... In this case, combined with the Scheduler inefficiency, causes it to try to unload a CMB which is really not in use (i.e. can be overwritten without checks...). I will try to solve those two problems this afternoon after my 1230-200 class. Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@boom.CS.Berkeley.EDU Tue Jan 19 18:26 PST 1999 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.1a/) with ESMTP id SAA05525; Tue, 19 Jan 1999 18:26:25 -0800 (PST) Received: from boom.CS.Berkeley.EDU (boom.CS.Berkeley.EDU [128.32.131.183]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id SAA29544; Tue, 19 Jan 1999 18:26:24 -0800 (PST) Received: from localhost (mmchu@localhost) by boom.CS.Berkeley.EDU (8.8.3/8.8.2) with SMTP id SAA19319; Tue, 19 Jan 1999 18:26:23 -0800 (PST) Date: Tue, 19 Jan 1999 18:26:23 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi cc: Michael Chu Subject: [SCORE] Bugs solved/remaining. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1260 Status: RO X-Status: A X-Keywords: X-UID: 38 Hello all, So, today, I slayed 2 more bugs in the system (in the scheduler and simulator). Given the test cases I have now, the remaining bugs (at least that can be seen) are of the following types: 1) Deadlock (detectable). System just has nothing left to run. 2) Deadlock (not-detectable since it is oscillating). System is somehow flipflopping between 2 possible configs every timeslice. 3) A miscount during the jobSelection() trials. By the time I try to nail down the CMB placement, I run out of CMBs (which should not happen if I accounted for CMBs correctly.). 4) Somehow, a page has an event scheduled after it has scheduled done. My game plan is to tackle 3 and 4 first, hopefully reducing all bugs (that I know of), to 1 and 2. Then, figure out 2, and then figure out 1. I may not be able to do too much more tonight since I will be going to dinner and then sleeping (9:30am class... first one in a long time!) I will be in the office after 11am. Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@boom.CS.Berkeley.EDU Wed Jan 20 21:02 PST 1999 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.1a/) with ESMTP id VAA08571; Wed, 20 Jan 1999 21:02:32 -0800 (PST) Received: from boom.CS.Berkeley.EDU (boom.CS.Berkeley.EDU [128.32.131.183]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id VAA23046; Wed, 20 Jan 1999 21:02:31 -0800 (PST) Received: from localhost (mmchu@localhost) by boom.CS.Berkeley.EDU (8.8.3/8.8.2) with SMTP id VAA23922; Wed, 20 Jan 1999 21:02:31 -0800 (PST) Date: Wed, 20 Jan 1999 21:02:31 -0800 (PST) From: Michael Chu Reply-To: Michael Chu To: Randy Huang , Eylon Caspi , "Andre' DeHon" cc: Michael Chu Subject: [SCORE] CS262 Bugs. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 2256 Status: RO X-Status: X-Keywords: X-UID: 39 Hey guys, I believe I have most of the bugs fixed. I will, of course, try to run more tests on it tomorrow to see if this is true. (In all, I have probably caught around 10 or so bugs in our code). I also added a flag to RandomGraph so that you can get the stats (utilization numbers). In general, you should then grep for: "Total cumulative" This should get you the total cumulative utilization across CP/CMB at each timeslice. So, if you increase the data size for feedforward, it usually increases the utilization quite a bit, even when the data size is below the timeslice. For non-feedforward graphs, it grows more slowly. Anyways, we knew this already, especially since we don't attempt to do clustering of feedback loops. So, some cavaets of our system: - it takes a bit of time to simulate data sizes of 10000+. We may have to optimize the simulator some (since, at sizes of 10000+, much of the time is spent in the simulator, just simulating, etc. - of course, I have not actually tested all aspects of the system; so stuff may break still (cross fingers!). In particular, the data dependent output rate stuff. - There are many scheduler inefficiencies which lead to higher makespans. They are on my list of things to do, but may not get all done this week (this week was for debugging). Anyways, if you get bored, please try to run the system and try to break it (messages about timeslices being too small and insufficient resources don't count unless you believe the message is in error!). Beware that, the larger datasizes on some graphs will appear to be doing nothing for quite a while between timeslices. The best way to gauge if it is doing something useful is to try grepping for the "Total cumulative" after the -stats option and seeing if timeslices happen (it may be many many minutes between timeslices... if it take more than an hour, let me know). All should be compiled in /project/cs/brass/a/tools/brass/develop/cs262_fa98 Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@boom.CS.Berkeley.EDU Thu Jan 21 11:03 PST 1999 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.1a/) with ESMTP id LAA21864; Thu, 21 Jan 1999 11:03:09 -0800 (PST) Received: from boom.CS.Berkeley.EDU (boom.CS.Berkeley.EDU [128.32.131.183]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id LAA03237; Thu, 21 Jan 1999 11:03:08 -0800 (PST) Received: from localhost (mmchu@localhost) by boom.CS.Berkeley.EDU (8.8.3/8.8.2) with SMTP id LAA25666; Thu, 21 Jan 1999 11:03:07 -0800 (PST) Date: Thu, 21 Jan 1999 11:03:07 -0800 (PST) From: Michael Chu To: Randy Huang , Eylon Caspi , "Andre' DeHon" cc: Michael Chu Subject: [SCORE] New, slightly-improved 262 system. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 810 Status: RO X-Status: X-Keywords: X-UID: 40 I found 1 (actually it is 2 closely related bugs) bug in our system between last night and this morning. It had to do with the possibility of the system going into deadlock but not being able to detect it (I still should eventually improve my detection of stalled pages on the array and actually find some way to recheck their schedulability, but I still need to think whether this would be feasible on hardware...) Anyways, long story short, new verion in the directory, compiled and waiting... Please give it a test drive if you have time. Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@boom.CS.Berkeley.EDU Thu Jan 21 19:15 PST 1999 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.1a/) with ESMTP id TAA07130; Thu, 21 Jan 1999 19:15:09 -0800 (PST) Received: from boom.CS.Berkeley.EDU (boom.CS.Berkeley.EDU [128.32.131.183]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id TAA14389; Thu, 21 Jan 1999 19:15:08 -0800 (PST) Received: from localhost (mmchu@localhost) by boom.CS.Berkeley.EDU (8.8.3/8.8.2) with SMTP id TAA26736; Thu, 21 Jan 1999 19:15:07 -0800 (PST) Date: Thu, 21 Jan 1999 19:15:07 -0800 (PST) From: Michael Chu To: Randy Huang cc: Michael Chu , Eylon Caspi Subject: CS262 - token time-of-flight Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1004 Status: RO X-Status: X-Keywords: X-UID: 41 Eylon says: A token's time-of-flight on a network stream is computed in networkRegsTo() in FramePosition.java: public int networkRegsTo (FramePosition p) { return 2 * registersPerLevel * crossoverDist(p); } registersPerLevel is presently the constant 1 but should really be a function of the crossover distance (which is the h-tree height of the cross-over switch between 2 communicating frames). e.g. time-of-flight is 0 cycles (1 cycles?) across the first-level cross-over. Time-of-flight might increase by quite a few cycles when comparing the 8th-level cross-over to the 7th. Also, CyclesToCrossArray should be computed from the same time-of-flight model, not specified arbitrarily (I guess that's in getCyclesToCrossArray()). ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@boom.CS.Berkeley.EDU Fri Jan 22 16:16 PST 1999 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.1a/) with ESMTP id QAA01337; Fri, 22 Jan 1999 16:16:02 -0800 (PST) Received: from boom.CS.Berkeley.EDU (boom.CS.Berkeley.EDU [128.32.131.183]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id QAA02778; Fri, 22 Jan 1999 16:16:01 -0800 (PST) Received: from localhost (mmchu@localhost) by boom.CS.Berkeley.EDU (8.8.3/8.8.2) with SMTP id QAA28896; Fri, 22 Jan 1999 16:16:01 -0800 (PST) Date: Fri, 22 Jan 1999 16:16:01 -0800 (PST) From: Michael Chu To: Eylon Caspi , Randy Huang , "Andre' DeHon" cc: Michael Chu Subject: [SCORE] Even more improved CS262! Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1188 Status: RO X-Status: X-Keywords: X-UID: 42 So, yesterday, Eylon and I discovered another bug in the Simulator (stream back pressure bug). In addition, after I added the back pressure hole token to code, I discovered another bug in the simulator code (PageStop events were generated both in the simulator and the scheduler, but were not being suspended along with the page). Anyways, long story short, code has been fixed and I believe it to be working okay now... (it explained why sometimes, we were getting CPs firing for 10s of cycles in a 100000 cycle timeslice). So, the code is there (in the main directory) and compiled. There is one bug (where, I am off by 100x when calculating cumulative utilization since I misinterpretted the magnitude a previous utilization calculation was in... anyway, if you look at total cumulative utilization, just divide by 100). I will fix this bug when Randy checks in Simulator.java. Let me know if you have any questions. Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From mmchu@boom.CS.Berkeley.EDU Fri Jan 22 16:24 PST 1999 Received: from cs2.CS.Berkeley.EDU (cs2.CS.Berkeley.EDU [169.229.60.56]) by mnemosyne.CS.Berkeley.EDU (8.9.1a/) with ESMTP id QAA01521; Fri, 22 Jan 1999 16:24:10 -0800 (PST) Received: from boom.CS.Berkeley.EDU (boom.CS.Berkeley.EDU [128.32.131.183]) by cs2.CS.Berkeley.EDU (8.9.1a/8.6.6.Beta11) with ESMTP id QAA02939; Fri, 22 Jan 1999 16:24:09 -0800 (PST) Received: from localhost (mmchu@localhost) by boom.CS.Berkeley.EDU (8.8.3/8.8.2) with SMTP id QAA28900; Fri, 22 Jan 1999 16:24:08 -0800 (PST) Date: Fri, 22 Jan 1999 16:24:08 -0800 (PST) From: Michael Chu To: Eylon Caspi , Randy Huang cc: Michael Chu , "Andre' DeHon" Subject: [SCORE] Hole delivery question. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1355 Status: RO X-Status: X-Keywords: X-UID: 43 Eylon, I figured that, the way the simulator is, there was no reason to really keep track of the # of available holes in the stream. I just insert the hole event at the ending edge of PageFiring. Then, when the hole reached the producer page, the event handler first sees if there is space on the stream. If so, it will then just unstall the page (if is was stalling on this output stream). The tokenemit will then be able to detect the available space in the old was (by counting the sizes of the stream queues). So, this does assume 1 thing: that holes produced by a particular firing start traveling all together at the end of the page firing. This is not entirely true. It may have been produced at the beginning of the firing, the middle of the firing, etc. But, this would be dependent on the compute page itself. But, currently, in our behavioural description, we don't have notation for when, within a firing, input tokens are dequeued (i.e. dequeue @ 0 cycles after firing starts, etc.). We might want to add this notation so that we can more accurately model the hardware. Michael. ------------------------------------------------------------------------- Michael Chu mmchu@cs.berkeley.edu University of California, Berkeley http://www.cs.berkeley.edu/~mmchu ------------------------------------------------------------------------- From eylon@cs.berkeley.edu Fri Jan 22 16:49:31 1999 Date: Fri, 22 Jan 1999 16:49:30 -0800 (PST) From: Eylon Caspi To: Michael Chu cc: Randy Huang , Andre' DeHon Subject: Re: [SCORE] Hole delivery question. In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO X-Status: X-Keywords: X-UID: 44 On Fri, 22 Jan 1999, Michael Chu wrote: > Eylon, > I figured that, the way the simulator is, there was no reason to really > keep track of the # of available holes in the stream. I just insert the > hole event at the ending edge of PageFiring. Then, when the hole > reached the producer page, the event handler first sees if there is > space on the stream. If so, it will then just unstall the page (if is > was stalling on this output stream). The tokenemit will then be able to > detect the available space in the old was (by counting the sizes of the > stream queues). Hole arrival indicates space on the stream for at least 1 token. So one should not have to explicitly check for stream fullness. In fact, stream fullness is always the wrong heuristic (I think) since it does not reflect token/hole propagation times. I think you need to unstall a stalled-on-output page on hole arrival without qualification. Subsequently stalling on a second output will be handled automatically (in the normal way). > > So, this does assume 1 thing: > that holes produced by a particular firing start traveling all together at > the end of the page firing. This is not entirely true. It may have been > produced at the beginning of the firing, the middle of the firing, etc. > But, this would be dependent on the compute page itself. But, currently, > in our behavioural description, we don't have notation for when, within a > firing, input tokens are dequeued (i.e. dequeue @ 0 cycles after firing > starts, etc.). > > We might want to add this notation so that we can more accurately model > the hardware. > > Michael. True. In full generality, a page would have input-stall logic from which it could request inputs at any time. The input-stall logic must, however, know how many tokens are consumed per firing on each stream. So we could solve this by assuming that the full set of tokens is moved into secondary registers at the start of a firing, immediately freeing space on the stream. To go easier on hardware, "immediately" should probably be "1 per cycle." But then we are getting close to the fully general case above. Leaving timing to the user may be the best thing to do. Eylon Caspi University of California, Berkeley eylon@cs.berkeley.edu Electrical Engineering & Computer Science tel. 510-843-8689