1 00:00:01,760 --> 00:00:05,760 Hello and thank you for joining me virtually  today! I'm Azalea Raad and I will be talking to   2 00:00:05,760 --> 00:00:10,320 you about our joint work with Ori Lahav and Viktor  Vafeiadis on persistent Owicki-Gries reasoning.   3 00:00:11,360 --> 00:00:16,320 Let's start with a quick introduction to what  persistency is. Traditionally, computer storage   4 00:00:16,320 --> 00:00:21,120 is divided into two categories. On the one hand,  we have the RAM which is fast to access, to read   5 00:00:21,120 --> 00:00:26,000 and write, but it is volatile in the sense that  it loses its contents after a power failure or a   6 00:00:26,000 --> 00:00:31,040 crash. On the other hand, we have hard disk drives  which are slow to access but they're persistent   7 00:00:31,040 --> 00:00:37,360 and preserve their contents across power failures  and crashes. Non-volatile memory or NVM for short   8 00:00:37,360 --> 00:00:42,160 is a hybrid of these two worlds, bringing  you the best of both. NVM is fast like RAM   9 00:00:42,160 --> 00:00:48,080 but persistent like hard disk drives. In fact,  NVM or persistent memory has become so prevalent   10 00:00:48,080 --> 00:00:52,880 that recently mainstream architectures such as  Intel and ARM have extended their models with   11 00:00:52,880 --> 00:00:59,360 persistency features. In our recent work in POPL  2020, we formalised the persistency guarantees   12 00:00:59,360 --> 00:01:05,680 of the Intel-x86 architecture and developed the  model called Px86 for persistent x86, and the   13 00:01:05,680 --> 00:01:11,840 way we did this was by extending the original  x86 (TSO) model with persistency features.   14 00:01:13,120 --> 00:01:17,680 It is important for this talk for us to have  an intuitive understanding of what Px86 is   15 00:01:17,680 --> 00:01:24,240 which itself requires an understanding of the  original x86 model. As some of you would know,   16 00:01:24,240 --> 00:01:28,240 we can think of the x86 architecture  diagrammatically as shown on this slide.   17 00:01:29,200 --> 00:01:34,160 On the top we have multiple threads of computation  or multiple cores, and at the bottom we have the   18 00:01:34,160 --> 00:01:40,000 memory which is shared amongst all threads. Each  thread additionally has a local buffer known   19 00:01:40,000 --> 00:01:45,760 as a store buffer which is only accessible  to the thread. Every time a thread issues a   20 00:01:45,760 --> 00:01:50,080 write this will be added to its own local buffer  and it's not immediately propagated to memory.   21 00:01:51,040 --> 00:01:54,800 The writes in the local buffer are only  visible to the thread and at some future   22 00:01:54,800 --> 00:01:59,120 point they will be non-deterministically  propagated to memory in FIFO order,   23 00:01:59,120 --> 00:02:04,480 by which point they become visible to other  threads. As a result, every time a thread issues   24 00:02:04,480 --> 00:02:08,240 a read, it will first check its own local  buffer to see if there is an entry there;   25 00:02:08,240 --> 00:02:13,360 otherwise, it will go and fetch it from memory.  Of course since the local buffers and the memory   26 00:02:13,360 --> 00:02:17,680 are volatile, if there is a crash, then the  contents of memory and buffers will be lost.   27 00:02:19,360 --> 00:02:25,440 You can think of the Px86 model as an extension  of the x86 diagram with an additional layer   28 00:02:25,440 --> 00:02:31,760 known as the persistency buffer or the pbuffer. As  before, every time a thread issues a write it will   29 00:02:31,760 --> 00:02:38,080 be added to its own local buffer. This time, the  writes in the local buffer will be propagated but   30 00:02:38,080 --> 00:02:41,600 instead of memory they will be going  to the *pbuffer* in the FIFO order.   31 00:02:42,640 --> 00:02:48,400 The writes in the pbuffer are now visible to all  other threads but they are still volatile. The   32 00:02:48,400 --> 00:02:53,600 writes in the pbuffer in turn will be propagated  to the memory at non-deterministic points in time,   33 00:02:53,600 --> 00:02:59,360 by which point they remain visible to all  threads and they also become persistent. This   34 00:02:59,360 --> 00:03:04,880 propagation from pbuffer to persistent memory is  no longer necessarily in the original FIFO order.   35 00:03:05,440 --> 00:03:10,400 It is in fact in the FIFO per-location  order, meaning that the writes on each   36 00:03:10,400 --> 00:03:13,920 location are propagated in the same  order that they entered the pbuffer,   37 00:03:13,920 --> 00:03:18,240 but the writes on different locations may be  propagated to memory in an arbitrary order. 38 00:03:20,880 --> 00:03:26,000 Similarly, every time a thread issues a read  it will traverse this hierarchy as before:   39 00:03:26,000 --> 00:03:28,160 first it will check the local  buffer to see if there is an   40 00:03:28,160 --> 00:03:31,760 entry there if so it will fetch it;  otherwise it will go to the pbuffer,   41 00:03:31,760 --> 00:03:35,040 if there is an entry there it will return it;  otherwise it will go all the way to memory.   42 00:03:36,480 --> 00:03:40,960 Once again, if there is a crash, since the buffer  and the pbuffer are volatile they will be lost,   43 00:03:40,960 --> 00:03:45,840 but the contents of the memory is now persistent  and will remain visible after a crash.   44 00:03:47,680 --> 00:03:52,960 Of course, it was important to understand and  formalize the persistency semantics of Intel-x86   45 00:03:52,960 --> 00:03:58,480 as we did in our POPL 2020 work. But when  it comes to reasoning about the properties   46 00:03:58,480 --> 00:04:04,160 of persistent programs written on top of  Px86, this is far from ideal. Specifically,   47 00:04:04,160 --> 00:04:08,560 this model is very low-level and any reasoning  at this level involves reasoning at the level   48 00:04:08,560 --> 00:04:14,320 of program *executions* or *traces* which is not  ideal and makes it difficult to verify high-level   49 00:04:14,320 --> 00:04:19,840 program invariants. So ideally what we want is a  high-level reasoning mechanism such as a program   50 00:04:19,840 --> 00:04:24,640 logic that allows us to reason at the level  of program *syntax*, and not program *traces*,   51 00:04:25,440 --> 00:04:30,880 which will make it simpler to verify high level  invariants. So this is the gap that we were trying   52 00:04:30,880 --> 00:04:34,800 to fill at the beginning of this work and when we  turned to the state of the art we realised that   53 00:04:34,800 --> 00:04:42,480 there is indeed no program logic for persistency  in the literature. We decided to develop and   54 00:04:42,480 --> 00:04:47,360 devise the first program logic for persistency and  for that we had two possible avenues available.   55 00:04:48,400 --> 00:04:53,120 First, we could do a *high-level* extension and by  that I mean that we could take an existing program   56 00:04:53,120 --> 00:04:59,840 logic that is sound for x86 and then extend that  to support the persistency features of Px86.   57 00:05:01,360 --> 00:05:05,600 Alternatively, we could leave our  program logic that is sound for x86   58 00:05:06,160 --> 00:05:11,680 untouched, and instead change the *low-level*  semantics, and by that I mean we could encode   59 00:05:11,680 --> 00:05:18,480 Px86 into x86 and use the same logic as  before to reason about Px86 programs.   60 00:05:19,680 --> 00:05:23,360 It turns out that for us the second  approach was simpler and this is what we   61 00:05:23,360 --> 00:05:30,640 did in this paper. More specifically, our  contributions are as follows. First and foremost,   62 00:05:30,640 --> 00:05:37,760 we developed an instrumented x86 (Ix86) semantics  which reduces Px86 and encodes Px86 to x86 by   63 00:05:37,760 --> 00:05:43,520 removing the pbuffer component. Then we provided  a translation mechanism that takes a Px86 program   64 00:05:43,520 --> 00:05:49,920 and gives us a program on top of Ix86. As I will  discuss shortly this involves several challenges.   65 00:05:50,880 --> 00:05:55,520 We then built our program logic, persistent  Owick-Gries reasoning or POG for short,   66 00:05:55,520 --> 00:05:59,840 on top of Ix86, and verified several  representative examples in the paper.   67 00:06:00,880 --> 00:06:05,840 For the rest of this talk, I will focus  on our instrumented x86 semantics or Ix86.   68 00:06:06,800 --> 00:06:12,320 As I said this involves several challenges  that I will show you through examples. First,   69 00:06:12,320 --> 00:06:18,400 let's look at a simple program where I assign 1  to x and 1 to y, and then the program crashes.   70 00:06:19,360 --> 00:06:24,800 Now if originally both x and y hold 0 in memory,  the question is what are the possible observable   71 00:06:24,800 --> 00:06:30,720 behaviours after the crash. In fact it turns  out that there are four possible behaviours as   72 00:06:30,720 --> 00:06:35,360 shown here, in the sense that if both writes were  propagated to memory before the crash then I will   73 00:06:35,360 --> 00:06:40,720 see 1 for both; or if neither did then I would  see 0 for both; or if the first one propagated to   74 00:06:40,720 --> 00:06:46,320 memory and the second one didn't then I will see x  is 1, y is 0. But most surprisingly, it's possible   75 00:06:46,320 --> 00:06:50,960 for me to see the second write having propagated  but not the first one. In other words, I can   76 00:06:50,960 --> 00:06:56,640 see these persists out of order. Intuitively,  the reason for that is because as we said the   77 00:06:56,640 --> 00:07:01,840 writes in the pbuffer on different locations can  be propagated to memory in an arbitrary order.   78 00:07:02,400 --> 00:07:07,360 In particular, let's look at this execution.  Here I have only one thread of execution and I   79 00:07:07,360 --> 00:07:12,400 have greyed out the other threads. So the thread  issues a write to x which like we said will be   80 00:07:12,400 --> 00:07:17,120 added to its own buffer and then it issues a write  to y; again it will be added to its own buffer.   81 00:07:17,680 --> 00:07:22,320 These writes are propagated to the pbuffer in  the FIFO order (in the same order that they were   82 00:07:22,320 --> 00:07:29,360 in the local buffer). Now these writes in the  pbuffer will start propagating to memory. But   83 00:07:29,360 --> 00:07:33,120 like we said the writes on different locations  don't necessarily have to reach memory in the   84 00:07:33,120 --> 00:07:38,560 same order. So here it's possible for the  write on y to be propagated to memory first.   85 00:07:39,120 --> 00:07:43,920 Now if a crash occurs at this point then the  contents of the pbuffer will be lost and when   86 00:07:43,920 --> 00:07:49,440 I restart the machine I will see the contents  of memory as x is 0 and y is 1, as shown in this   87 00:07:49,440 --> 00:07:58,720 behaviour. Remember that by encoding Px86 into x86  we essentially want to remove this pbuffer. So the   88 00:07:58,720 --> 00:08:03,280 question is how do we model this out-of-order  propagation in the absence of the pbuffer?   89 00:08:05,280 --> 00:08:10,320 The way we dealt with this was by  introducing two versions per location.   90 00:08:11,360 --> 00:08:16,480 Specifically, for each location x, we record  a volatile version xv and a persistent version   91 00:08:16,480 --> 00:08:23,440 xp. Our translation mechanism says that every time  you write to x write to xv instead, and every time   92 00:08:23,440 --> 00:08:30,400 you read from x read from xv instead. To model the  propagation from pbuffer to memory, we can say at   93 00:08:30,400 --> 00:08:37,600 non-deterministic points in time copy the contents  of xv to xp. If I go back to my example, on the   94 00:08:37,600 --> 00:08:42,080 left-hand side here I have the original program  and on the right-hand side I have its translation.   95 00:08:43,600 --> 00:08:47,840 I start with the state where everything is 0:  both versions of both locations are initially   96 00:08:47,840 --> 00:08:54,320 0. Then I assign 1 to xv so now I have xv is 1.  Now I also have to account for the unbuffering,   97 00:08:54,320 --> 00:09:00,240 for the propagation of the volatile versions  being copied to persistent versions.   98 00:09:00,240 --> 00:09:05,200 Here it is possible for me to copy xv to  xp so I can have xp as either 0 for the   99 00:09:05,200 --> 00:09:09,760 case that the propagation hasn't happened, or 1  for the case that the propagation has happened.   100 00:09:11,520 --> 00:09:17,280 Similarly, I write 1 to yv; yv is 1 and once  again I account for unbuffering and yp now can   101 00:09:17,280 --> 00:09:23,520 be either 0 or 1. Taken together, these two will  give me the four possible outcomes that I wanted. 102 00:09:25,760 --> 00:09:31,040 But there is another challenge. In fact, in  order to avoid such crazy out-of-order persists,   103 00:09:31,040 --> 00:09:35,840 Intel allows you to use something we  call explicit persist instructions,   104 00:09:36,720 --> 00:09:41,840 here written as flush x. In this example,  if I insert a flush x between these two   105 00:09:42,800 --> 00:09:48,160 writes, this says that make sure all  earlier writes in program order on x   106 00:09:48,880 --> 00:09:54,480 persist before all later writes. Here, this  insertion of flush x makes sure the write on x   107 00:09:54,480 --> 00:09:59,600 persists before the write on y, and as a result  I don't have this out-of-order persist behaviour   108 00:09:59,600 --> 00:10:04,800 anymore. In other words, this flush x imposes  additional ordering constraints in the pbuffer.   109 00:10:06,240 --> 00:10:10,160 Note that flush x behaves *asynchronously*  in the sense that when it's executed   110 00:10:10,160 --> 00:10:14,720 it doesn't necessarily make sure that x is  persisted, and as a result it's still possible   111 00:10:14,720 --> 00:10:22,080 to see x is 0. All flush x does is add ordering  constraints in the pbuffer. Once again, since we   112 00:10:22,080 --> 00:10:28,000 want to remove the pbuffer, the question is how do  we model this without the pbuffer on top of x86.   113 00:10:29,600 --> 00:10:33,920 As a naive attempt, we kept our mechanism  and representation as before and we said   114 00:10:34,560 --> 00:10:41,280 as translation of flush x, every time you execute  it copy xv to xp. Let's see what goes wrong here.   115 00:10:42,160 --> 00:10:46,480 Once again I have my program on the left-hand side  and its translation on the right-hand side and I   116 00:10:46,480 --> 00:10:52,880 start with 0 for both versions of both locations.  As before, I write 1 to xv and account for   117 00:10:52,880 --> 00:10:58,560 unbuffering. Now when I execute the flush, I will  be assigning xv to xp, meaning that now xp will   118 00:10:58,560 --> 00:11:04,960 be 1. Once again, I assign 1 to yv and account  for unbuffering, which means that yp could be   119 00:11:04,960 --> 00:11:12,000 either 0 or 1. Now since xp is 1, taken together  this will give me two outcomes: either they're   120 00:11:12,000 --> 00:11:18,480 both 1 or xp is 1 and yp is 0. In other words,  we cannot model the behaviour where xp is 0.   121 00:11:18,480 --> 00:11:23,440 This is because the translation is modelling the  behaviour of flush instructions *synchronously*   122 00:11:23,440 --> 00:11:29,200 which we know is incorrect. So we went back  to the drawing board and instead extended our   123 00:11:29,200 --> 00:11:34,400 representation with an additional version for  each location. So now we have *three* versions   124 00:11:34,400 --> 00:11:40,000 with the new version, xs (the synchronous version)  used purely for modelling flush instructions.   125 00:11:40,960 --> 00:11:46,000 Similarly, we adjusted our translation. We  kept the write and read translation as before.   126 00:11:46,000 --> 00:11:51,920 Now every time we execute a flush x we copy  xv into this intermediate new version, xs.   127 00:11:53,200 --> 00:11:56,400 To model the unbuffering -- remember  we have two types of unbuffering.   128 00:11:56,400 --> 00:12:01,600 To model the unbuffering from buffer to  persistent buffer (pbuffer) we copy xv to xs.   129 00:12:02,240 --> 00:12:07,200 To model the unbuffering from pbuffer to memory  we copy xs (the intermediate version) to xp.   130 00:12:08,240 --> 00:12:12,800 For technical reasons, this propagation (this  unbuffering) has to be done for *all locations at   131 00:12:12,800 --> 00:12:18,240 once*, hence the overline notation. Intuitively,  this is to make sure we capture *all* ordering   132 00:12:18,240 --> 00:12:24,160 constraints imposed by flush instructions. Going  back to our example, now we have three versions;   133 00:12:24,160 --> 00:12:29,360 so let's assume that all three versions are set  to 0 initially. I assign 1 to xv and account for   134 00:12:29,360 --> 00:12:35,600 unbuffering and this time the unbuffering involves  copying from xv to xs and from xs to xp. So xs and   135 00:12:35,600 --> 00:12:42,160 xv could both be either 0 or 1. Next, when I  execute flush x, I will be copying xv to xs,   136 00:12:42,160 --> 00:12:50,800 which means that now I have xs=1. But xp is still  0 or 1. Next, I assign 1 to yv and yv becomes 1.   137 00:12:50,800 --> 00:12:56,560 Now I'll have to account for unbuffering. When  I account for unbuffering it means that I have   138 00:12:56,560 --> 00:13:04,000 to assume that yv can be copied to ys, which  means that ys could be 0 or 1. But I also have to   139 00:13:04,000 --> 00:13:10,880 account for the case where all s-versions, xs and  ys, are copied to their corresponding p-versions.   140 00:13:11,680 --> 00:13:19,920 Since xs is 1, if this copying happens, then if as  a result yp becomes 1 then xp will have to become   141 00:13:19,920 --> 00:13:27,920 1 as well. So we get this invariant that says  if yp is 1, i.e. ys=1 has been propagated to yp,   142 00:13:28,480 --> 00:13:35,520 then xp must be 1, i.e. xs must have also  been propagated to xp. With this invariant,   143 00:13:35,520 --> 00:13:40,400 now I have exactly the behaviours that I wanted  from before, the three cases as shown here. 144 00:13:42,720 --> 00:13:48,560 Finally, there is an additional challenge in  that Intel gives us an additional explicit   145 00:13:48,560 --> 00:13:54,800 persist instruction known as flush-optimised.  Its behaviour is more or less like flush:   146 00:13:54,800 --> 00:13:59,760 it behaves asynchronously. But it has weaker  ordering constraints in the sense that it   147 00:13:59,760 --> 00:14:03,120 itself can be reordered with respect to  other instructions. So in this program,   148 00:14:03,120 --> 00:14:08,320 it is possible for me to order the flush-optimised  after the write and as a result once again it is   149 00:14:08,320 --> 00:14:13,440 possible to observe this weak behaviour. Now  we could have gone back to the drawing board   150 00:14:13,440 --> 00:14:17,680 and once again extended our representation  of these versions and our translation. But   151 00:14:17,680 --> 00:14:24,000 we found that this was rather difficult: encoding  flush-optimised in this representation was hard.   152 00:14:24,000 --> 00:14:28,960 Instead, what we decided to do was rewrite  programs so that instead of using flush-optimised   153 00:14:28,960 --> 00:14:32,640 they use flushes, and this rewriting will  have to preserve the program behaviour,   154 00:14:32,640 --> 00:14:35,040 i.e. we will have to rewrite it  into an *equivalent* program.   155 00:14:36,240 --> 00:14:41,840 Specifically, when we inspected the code bases  that use flush-optimised, we realised that   156 00:14:43,040 --> 00:14:48,240 this is often done in a particular way, following  a particular programming pattern known as epoch   157 00:14:48,240 --> 00:14:54,720 persistency. In such cases, we realised that it  is always possible to transform the program and   158 00:14:54,720 --> 00:15:00,560 rewrite it so that instead of flush-optimised it  uses flushes. So we developed this transformation   159 00:15:00,560 --> 00:15:05,520 mechanism and showed that it indeed preserves the  meaning of the program so that this transformation   160 00:15:05,520 --> 00:15:09,360 yields an equivalent program behaviour  when we replace flush-optimised with flush. 161 00:15:11,920 --> 00:15:17,760 This brings me to the end of my talk. I  showed you an overview of our instrumented   162 00:15:17,760 --> 00:15:24,320 (Ix86) semantics and how we reduce Px86 to x86  by removing the pbuffer. I very quickly told you   163 00:15:24,320 --> 00:15:29,200 about our transformation mechanism for rewriting  programs with flush-optimised to ones using flush.   164 00:15:29,200 --> 00:15:33,120 What I didn't show you was our program logic,  persistent Owicki-Gries Reasoning (POG),   165 00:15:33,120 --> 00:15:38,240 which is built on top of Ix86. In a nutshell  you can think of our pipeline as follows:   166 00:15:38,800 --> 00:15:43,440 take a Px86 program which could potentially  use flush-optimised instructions,   167 00:15:43,440 --> 00:15:48,000 feed it to our transformation mechanism and  obtain another Px86 program that doesn't have   168 00:15:48,000 --> 00:15:54,320 flush-optimised instructions. Then run it through  our translation to convert it to an Ix86 program.   169 00:15:54,960 --> 00:15:58,240 Finally, feed this program  to our program logic, POG,   170 00:15:58,240 --> 00:16:05,440 to obtain a *verified* Ix86 program. That's all  I have to say. Thank you very much for listening!