1
00:00:01,760 --> 00:00:05,760
Hello and thank you for joining me virtually 
today! I'm Azalea Raad and I will be talking to  

2
00:00:05,760 --> 00:00:10,320
you about our joint work with Ori Lahav and Viktor 
Vafeiadis on persistent Owicki-Gries reasoning.  

3
00:00:11,360 --> 00:00:16,320
Let's start with a quick introduction to what 
persistency is. Traditionally, computer storage  

4
00:00:16,320 --> 00:00:21,120
is divided into two categories. On the one hand, 
we have the RAM which is fast to access, to read  

5
00:00:21,120 --> 00:00:26,000
and write, but it is volatile in the sense that 
it loses its contents after a power failure or a  

6
00:00:26,000 --> 00:00:31,040
crash. On the other hand, we have hard disk drives 
which are slow to access but they're persistent  

7
00:00:31,040 --> 00:00:37,360
and preserve their contents across power failures 
and crashes. Non-volatile memory or NVM for short  

8
00:00:37,360 --> 00:00:42,160
is a hybrid of these two worlds, bringing 
you the best of both. NVM is fast like RAM  

9
00:00:42,160 --> 00:00:48,080
but persistent like hard disk drives. In fact, 
NVM or persistent memory has become so prevalent  

10
00:00:48,080 --> 00:00:52,880
that recently mainstream architectures such as 
Intel and ARM have extended their models with  

11
00:00:52,880 --> 00:00:59,360
persistency features. In our recent work in POPL 
2020, we formalised the persistency guarantees  

12
00:00:59,360 --> 00:01:05,680
of the Intel-x86 architecture and developed the 
model called Px86 for persistent x86, and the  

13
00:01:05,680 --> 00:01:11,840
way we did this was by extending the original 
x86 (TSO) model with persistency features.  

14
00:01:13,120 --> 00:01:17,680
It is important for this talk for us to have 
an intuitive understanding of what Px86 is  

15
00:01:17,680 --> 00:01:24,240
which itself requires an understanding of the 
original x86 model. As some of you would know,  

16
00:01:24,240 --> 00:01:28,240
we can think of the x86 architecture 
diagrammatically as shown on this slide.  

17
00:01:29,200 --> 00:01:34,160
On the top we have multiple threads of computation 
or multiple cores, and at the bottom we have the  

18
00:01:34,160 --> 00:01:40,000
memory which is shared amongst all threads. Each 
thread additionally has a local buffer known  

19
00:01:40,000 --> 00:01:45,760
as a store buffer which is only accessible 
to the thread. Every time a thread issues a  

20
00:01:45,760 --> 00:01:50,080
write this will be added to its own local buffer 
and it's not immediately propagated to memory.  

21
00:01:51,040 --> 00:01:54,800
The writes in the local buffer are only 
visible to the thread and at some future  

22
00:01:54,800 --> 00:01:59,120
point they will be non-deterministically 
propagated to memory in FIFO order,  

23
00:01:59,120 --> 00:02:04,480
by which point they become visible to other 
threads. As a result, every time a thread issues  

24
00:02:04,480 --> 00:02:08,240
a read, it will first check its own local 
buffer to see if there is an entry there;  

25
00:02:08,240 --> 00:02:13,360
otherwise, it will go and fetch it from memory. 
Of course since the local buffers and the memory  

26
00:02:13,360 --> 00:02:17,680
are volatile, if there is a crash, then the 
contents of memory and buffers will be lost.  

27
00:02:19,360 --> 00:02:25,440
You can think of the Px86 model as an extension 
of the x86 diagram with an additional layer  

28
00:02:25,440 --> 00:02:31,760
known as the persistency buffer or the pbuffer. As 
before, every time a thread issues a write it will  

29
00:02:31,760 --> 00:02:38,080
be added to its own local buffer. This time, the 
writes in the local buffer will be propagated but  

30
00:02:38,080 --> 00:02:41,600
instead of memory they will be going 
to the *pbuffer* in the FIFO order.  

31
00:02:42,640 --> 00:02:48,400
The writes in the pbuffer are now visible to all 
other threads but they are still volatile. The  

32
00:02:48,400 --> 00:02:53,600
writes in the pbuffer in turn will be propagated 
to the memory at non-deterministic points in time,  

33
00:02:53,600 --> 00:02:59,360
by which point they remain visible to all 
threads and they also become persistent. This  

34
00:02:59,360 --> 00:03:04,880
propagation from pbuffer to persistent memory is 
no longer necessarily in the original FIFO order.  

35
00:03:05,440 --> 00:03:10,400
It is in fact in the FIFO per-location 
order, meaning that the writes on each  

36
00:03:10,400 --> 00:03:13,920
location are propagated in the same 
order that they entered the pbuffer,  

37
00:03:13,920 --> 00:03:18,240
but the writes on different locations may be 
propagated to memory in an arbitrary order.

38
00:03:20,880 --> 00:03:26,000
Similarly, every time a thread issues a read 
it will traverse this hierarchy as before:  

39
00:03:26,000 --> 00:03:28,160
first it will check the local 
buffer to see if there is an  

40
00:03:28,160 --> 00:03:31,760
entry there if so it will fetch it; 
otherwise it will go to the pbuffer,  

41
00:03:31,760 --> 00:03:35,040
if there is an entry there it will return it; 
otherwise it will go all the way to memory.  

42
00:03:36,480 --> 00:03:40,960
Once again, if there is a crash, since the buffer 
and the pbuffer are volatile they will be lost,  

43
00:03:40,960 --> 00:03:45,840
but the contents of the memory is now persistent 
and will remain visible after a crash.  

44
00:03:47,680 --> 00:03:52,960
Of course, it was important to understand and 
formalize the persistency semantics of Intel-x86  

45
00:03:52,960 --> 00:03:58,480
as we did in our POPL 2020 work. But when 
it comes to reasoning about the properties  

46
00:03:58,480 --> 00:04:04,160
of persistent programs written on top of 
Px86, this is far from ideal. Specifically,  

47
00:04:04,160 --> 00:04:08,560
this model is very low-level and any reasoning 
at this level involves reasoning at the level  

48
00:04:08,560 --> 00:04:14,320
of program *executions* or *traces* which is not 
ideal and makes it difficult to verify high-level  

49
00:04:14,320 --> 00:04:19,840
program invariants. So ideally what we want is a 
high-level reasoning mechanism such as a program  

50
00:04:19,840 --> 00:04:24,640
logic that allows us to reason at the level 
of program *syntax*, and not program *traces*,  

51
00:04:25,440 --> 00:04:30,880
which will make it simpler to verify high level 
invariants. So this is the gap that we were trying  

52
00:04:30,880 --> 00:04:34,800
to fill at the beginning of this work and when we 
turned to the state of the art we realised that  

53
00:04:34,800 --> 00:04:42,480
there is indeed no program logic for persistency 
in the literature. We decided to develop and  

54
00:04:42,480 --> 00:04:47,360
devise the first program logic for persistency and 
for that we had two possible avenues available.  

55
00:04:48,400 --> 00:04:53,120
First, we could do a *high-level* extension and by 
that I mean that we could take an existing program  

56
00:04:53,120 --> 00:04:59,840
logic that is sound for x86 and then extend that 
to support the persistency features of Px86.  

57
00:05:01,360 --> 00:05:05,600
Alternatively, we could leave our 
program logic that is sound for x86  

58
00:05:06,160 --> 00:05:11,680
untouched, and instead change the *low-level* 
semantics, and by that I mean we could encode  

59
00:05:11,680 --> 00:05:18,480
Px86 into x86 and use the same logic as 
before to reason about Px86 programs.  

60
00:05:19,680 --> 00:05:23,360
It turns out that for us the second 
approach was simpler and this is what we  

61
00:05:23,360 --> 00:05:30,640
did in this paper. More specifically, our 
contributions are as follows. First and foremost,  

62
00:05:30,640 --> 00:05:37,760
we developed an instrumented x86 (Ix86) semantics 
which reduces Px86 and encodes Px86 to x86 by  

63
00:05:37,760 --> 00:05:43,520
removing the pbuffer component. Then we provided 
a translation mechanism that takes a Px86 program  

64
00:05:43,520 --> 00:05:49,920
and gives us a program on top of Ix86. As I will 
discuss shortly this involves several challenges.  

65
00:05:50,880 --> 00:05:55,520
We then built our program logic, persistent 
Owick-Gries reasoning or POG for short,  

66
00:05:55,520 --> 00:05:59,840
on top of Ix86, and verified several 
representative examples in the paper.  

67
00:06:00,880 --> 00:06:05,840
For the rest of this talk, I will focus 
on our instrumented x86 semantics or Ix86.  

68
00:06:06,800 --> 00:06:12,320
As I said this involves several challenges 
that I will show you through examples. First,  

69
00:06:12,320 --> 00:06:18,400
let's look at a simple program where I assign 1 
to x and 1 to y, and then the program crashes.  

70
00:06:19,360 --> 00:06:24,800
Now if originally both x and y hold 0 in memory, 
the question is what are the possible observable  

71
00:06:24,800 --> 00:06:30,720
behaviours after the crash. In fact it turns 
out that there are four possible behaviours as  

72
00:06:30,720 --> 00:06:35,360
shown here, in the sense that if both writes were 
propagated to memory before the crash then I will  

73
00:06:35,360 --> 00:06:40,720
see 1 for both; or if neither did then I would 
see 0 for both; or if the first one propagated to  

74
00:06:40,720 --> 00:06:46,320
memory and the second one didn't then I will see x 
is 1, y is 0. But most surprisingly, it's possible  

75
00:06:46,320 --> 00:06:50,960
for me to see the second write having propagated 
but not the first one. In other words, I can  

76
00:06:50,960 --> 00:06:56,640
see these persists out of order. Intuitively, 
the reason for that is because as we said the  

77
00:06:56,640 --> 00:07:01,840
writes in the pbuffer on different locations can 
be propagated to memory in an arbitrary order.  

78
00:07:02,400 --> 00:07:07,360
In particular, let's look at this execution. 
Here I have only one thread of execution and I  

79
00:07:07,360 --> 00:07:12,400
have greyed out the other threads. So the thread 
issues a write to x which like we said will be  

80
00:07:12,400 --> 00:07:17,120
added to its own buffer and then it issues a write 
to y; again it will be added to its own buffer.  

81
00:07:17,680 --> 00:07:22,320
These writes are propagated to the pbuffer in 
the FIFO order (in the same order that they were  

82
00:07:22,320 --> 00:07:29,360
in the local buffer). Now these writes in the 
pbuffer will start propagating to memory. But  

83
00:07:29,360 --> 00:07:33,120
like we said the writes on different locations 
don't necessarily have to reach memory in the  

84
00:07:33,120 --> 00:07:38,560
same order. So here it's possible for the 
write on y to be propagated to memory first.  

85
00:07:39,120 --> 00:07:43,920
Now if a crash occurs at this point then the 
contents of the pbuffer will be lost and when  

86
00:07:43,920 --> 00:07:49,440
I restart the machine I will see the contents 
of memory as x is 0 and y is 1, as shown in this  

87
00:07:49,440 --> 00:07:58,720
behaviour. Remember that by encoding Px86 into x86 
we essentially want to remove this pbuffer. So the  

88
00:07:58,720 --> 00:08:03,280
question is how do we model this out-of-order 
propagation in the absence of the pbuffer?  

89
00:08:05,280 --> 00:08:10,320
The way we dealt with this was by 
introducing two versions per location.  

90
00:08:11,360 --> 00:08:16,480
Specifically, for each location x, we record 
a volatile version xv and a persistent version  

91
00:08:16,480 --> 00:08:23,440
xp. Our translation mechanism says that every time 
you write to x write to xv instead, and every time  

92
00:08:23,440 --> 00:08:30,400
you read from x read from xv instead. To model the 
propagation from pbuffer to memory, we can say at  

93
00:08:30,400 --> 00:08:37,600
non-deterministic points in time copy the contents 
of xv to xp. If I go back to my example, on the  

94
00:08:37,600 --> 00:08:42,080
left-hand side here I have the original program 
and on the right-hand side I have its translation.  

95
00:08:43,600 --> 00:08:47,840
I start with the state where everything is 0: 
both versions of both locations are initially  

96
00:08:47,840 --> 00:08:54,320
0. Then I assign 1 to xv so now I have xv is 1. 
Now I also have to account for the unbuffering,  

97
00:08:54,320 --> 00:09:00,240
for the propagation of the volatile versions 
being copied to persistent versions.  

98
00:09:00,240 --> 00:09:05,200
Here it is possible for me to copy xv to 
xp so I can have xp as either 0 for the  

99
00:09:05,200 --> 00:09:09,760
case that the propagation hasn't happened, or 1 
for the case that the propagation has happened.  

100
00:09:11,520 --> 00:09:17,280
Similarly, I write 1 to yv; yv is 1 and once 
again I account for unbuffering and yp now can  

101
00:09:17,280 --> 00:09:23,520
be either 0 or 1. Taken together, these two will 
give me the four possible outcomes that I wanted.

102
00:09:25,760 --> 00:09:31,040
But there is another challenge. In fact, in 
order to avoid such crazy out-of-order persists,  

103
00:09:31,040 --> 00:09:35,840
Intel allows you to use something we 
call explicit persist instructions,  

104
00:09:36,720 --> 00:09:41,840
here written as flush x. In this example, 
if I insert a flush x between these two  

105
00:09:42,800 --> 00:09:48,160
writes, this says that make sure all 
earlier writes in program order on x  

106
00:09:48,880 --> 00:09:54,480
persist before all later writes. Here, this 
insertion of flush x makes sure the write on x  

107
00:09:54,480 --> 00:09:59,600
persists before the write on y, and as a result 
I don't have this out-of-order persist behaviour  

108
00:09:59,600 --> 00:10:04,800
anymore. In other words, this flush x imposes 
additional ordering constraints in the pbuffer.  

109
00:10:06,240 --> 00:10:10,160
Note that flush x behaves *asynchronously* 
in the sense that when it's executed  

110
00:10:10,160 --> 00:10:14,720
it doesn't necessarily make sure that x is 
persisted, and as a result it's still possible  

111
00:10:14,720 --> 00:10:22,080
to see x is 0. All flush x does is add ordering 
constraints in the pbuffer. Once again, since we  

112
00:10:22,080 --> 00:10:28,000
want to remove the pbuffer, the question is how do 
we model this without the pbuffer on top of x86.  

113
00:10:29,600 --> 00:10:33,920
As a naive attempt, we kept our mechanism 
and representation as before and we said  

114
00:10:34,560 --> 00:10:41,280
as translation of flush x, every time you execute 
it copy xv to xp. Let's see what goes wrong here.  

115
00:10:42,160 --> 00:10:46,480
Once again I have my program on the left-hand side 
and its translation on the right-hand side and I  

116
00:10:46,480 --> 00:10:52,880
start with 0 for both versions of both locations. 
As before, I write 1 to xv and account for  

117
00:10:52,880 --> 00:10:58,560
unbuffering. Now when I execute the flush, I will 
be assigning xv to xp, meaning that now xp will  

118
00:10:58,560 --> 00:11:04,960
be 1. Once again, I assign 1 to yv and account 
for unbuffering, which means that yp could be  

119
00:11:04,960 --> 00:11:12,000
either 0 or 1. Now since xp is 1, taken together 
this will give me two outcomes: either they're  

120
00:11:12,000 --> 00:11:18,480
both 1 or xp is 1 and yp is 0. In other words, 
we cannot model the behaviour where xp is 0.  

121
00:11:18,480 --> 00:11:23,440
This is because the translation is modelling the 
behaviour of flush instructions *synchronously*  

122
00:11:23,440 --> 00:11:29,200
which we know is incorrect. So we went back 
to the drawing board and instead extended our  

123
00:11:29,200 --> 00:11:34,400
representation with an additional version for 
each location. So now we have *three* versions  

124
00:11:34,400 --> 00:11:40,000
with the new version, xs (the synchronous version) 
used purely for modelling flush instructions.  

125
00:11:40,960 --> 00:11:46,000
Similarly, we adjusted our translation. We 
kept the write and read translation as before.  

126
00:11:46,000 --> 00:11:51,920
Now every time we execute a flush x we copy 
xv into this intermediate new version, xs.  

127
00:11:53,200 --> 00:11:56,400
To model the unbuffering -- remember 
we have two types of unbuffering.  

128
00:11:56,400 --> 00:12:01,600
To model the unbuffering from buffer to 
persistent buffer (pbuffer) we copy xv to xs.  

129
00:12:02,240 --> 00:12:07,200
To model the unbuffering from pbuffer to memory 
we copy xs (the intermediate version) to xp.  

130
00:12:08,240 --> 00:12:12,800
For technical reasons, this propagation (this 
unbuffering) has to be done for *all locations at  

131
00:12:12,800 --> 00:12:18,240
once*, hence the overline notation. Intuitively, 
this is to make sure we capture *all* ordering  

132
00:12:18,240 --> 00:12:24,160
constraints imposed by flush instructions. Going 
back to our example, now we have three versions;  

133
00:12:24,160 --> 00:12:29,360
so let's assume that all three versions are set 
to 0 initially. I assign 1 to xv and account for  

134
00:12:29,360 --> 00:12:35,600
unbuffering and this time the unbuffering involves 
copying from xv to xs and from xs to xp. So xs and  

135
00:12:35,600 --> 00:12:42,160
xv could both be either 0 or 1. Next, when I 
execute flush x, I will be copying xv to xs,  

136
00:12:42,160 --> 00:12:50,800
which means that now I have xs=1. But xp is still 
0 or 1. Next, I assign 1 to yv and yv becomes 1.  

137
00:12:50,800 --> 00:12:56,560
Now I'll have to account for unbuffering. When 
I account for unbuffering it means that I have  

138
00:12:56,560 --> 00:13:04,000
to assume that yv can be copied to ys, which 
means that ys could be 0 or 1. But I also have to  

139
00:13:04,000 --> 00:13:10,880
account for the case where all s-versions, xs and 
ys, are copied to their corresponding p-versions.  

140
00:13:11,680 --> 00:13:19,920
Since xs is 1, if this copying happens, then if as 
a result yp becomes 1 then xp will have to become  

141
00:13:19,920 --> 00:13:27,920
1 as well. So we get this invariant that says 
if yp is 1, i.e. ys=1 has been propagated to yp,  

142
00:13:28,480 --> 00:13:35,520
then xp must be 1, i.e. xs must have also 
been propagated to xp. With this invariant,  

143
00:13:35,520 --> 00:13:40,400
now I have exactly the behaviours that I wanted 
from before, the three cases as shown here.

144
00:13:42,720 --> 00:13:48,560
Finally, there is an additional challenge in 
that Intel gives us an additional explicit  

145
00:13:48,560 --> 00:13:54,800
persist instruction known as flush-optimised. 
Its behaviour is more or less like flush:  

146
00:13:54,800 --> 00:13:59,760
it behaves asynchronously. But it has weaker 
ordering constraints in the sense that it  

147
00:13:59,760 --> 00:14:03,120
itself can be reordered with respect to 
other instructions. So in this program,  

148
00:14:03,120 --> 00:14:08,320
it is possible for me to order the flush-optimised 
after the write and as a result once again it is  

149
00:14:08,320 --> 00:14:13,440
possible to observe this weak behaviour. Now 
we could have gone back to the drawing board  

150
00:14:13,440 --> 00:14:17,680
and once again extended our representation 
of these versions and our translation. But  

151
00:14:17,680 --> 00:14:24,000
we found that this was rather difficult: encoding 
flush-optimised in this representation was hard.  

152
00:14:24,000 --> 00:14:28,960
Instead, what we decided to do was rewrite 
programs so that instead of using flush-optimised  

153
00:14:28,960 --> 00:14:32,640
they use flushes, and this rewriting will 
have to preserve the program behaviour,  

154
00:14:32,640 --> 00:14:35,040
i.e. we will have to rewrite it 
into an *equivalent* program.  

155
00:14:36,240 --> 00:14:41,840
Specifically, when we inspected the code bases 
that use flush-optimised, we realised that  

156
00:14:43,040 --> 00:14:48,240
this is often done in a particular way, following 
a particular programming pattern known as epoch  

157
00:14:48,240 --> 00:14:54,720
persistency. In such cases, we realised that it 
is always possible to transform the program and  

158
00:14:54,720 --> 00:15:00,560
rewrite it so that instead of flush-optimised it 
uses flushes. So we developed this transformation  

159
00:15:00,560 --> 00:15:05,520
mechanism and showed that it indeed preserves the 
meaning of the program so that this transformation  

160
00:15:05,520 --> 00:15:09,360
yields an equivalent program behaviour 
when we replace flush-optimised with flush.

161
00:15:11,920 --> 00:15:17,760
This brings me to the end of my talk. I 
showed you an overview of our instrumented  

162
00:15:17,760 --> 00:15:24,320
(Ix86) semantics and how we reduce Px86 to x86 
by removing the pbuffer. I very quickly told you  

163
00:15:24,320 --> 00:15:29,200
about our transformation mechanism for rewriting 
programs with flush-optimised to ones using flush.  

164
00:15:29,200 --> 00:15:33,120
What I didn't show you was our program logic, 
persistent Owicki-Gries Reasoning (POG),  

165
00:15:33,120 --> 00:15:38,240
which is built on top of Ix86. In a nutshell 
you can think of our pipeline as follows:  

166
00:15:38,800 --> 00:15:43,440
take a Px86 program which could potentially 
use flush-optimised instructions,  

167
00:15:43,440 --> 00:15:48,000
feed it to our transformation mechanism and 
obtain another Px86 program that doesn't have  

168
00:15:48,000 --> 00:15:54,320
flush-optimised instructions. Then run it through 
our translation to convert it to an Ix86 program.  

169
00:15:54,960 --> 00:15:58,240
Finally, feed this program 
to our program logic, POG,  

170
00:15:58,240 --> 00:16:05,440
to obtain a *verified* Ix86 program. That's all 
I have to say. Thank you very much for listening!