The horror of horrors
As you've perhaps noticed, MPEx went down to help me celebrate my birthday. This is the story of all that.
Part I. About a month ago, we confronted a social engineering attacki that managed to push through the datafeeds fake trades for about one hour. The servers could have been trivially restarted at that point, but I elected to instead have the code thoroughly reviewed, further code written to prevent this possibility for the future, extensively tested and so on. This meant a few days of downtime at the end of September for MPEx.
On October the 25th, around 5 am in the GMT, the proxies became desynchronized. This in itself is not particularly rare an event. In this case however, measures implemented as discussed before resulted in the proxies being unable to resync, each recognising a large enough section of the others as "traitorous" to make progress impossible. As a result, any orders issued by a customer had 99.9% or better chances of being rejected (probably as high as 100% in the case of STATs).
This, again, could have been more or less easily resolved. I elected however to allow the situation to progress for up to 36 hours, to collect data that will allow us to better understand the problem, to see if this is the early stage of an attack that might later declare itself and to have sufficient time to prepare and deploy a solution that would make the problem unlikely to recur in the future. Finally, around midday on the 26th MPEx was back online, people could verify their balances are ok and so on.
Part II About three and a half hours later, the RAID array powering the trade engine went into degraded status. The RAID is configured as 1+0 and it uses at least theoretically upmarket Intel drives. The former meant we could have just popped out the bad drive and reconstructed it on the flyii. The latter however sat unwell with the CTO's native paranoia, and so we brought the entire system down to check exactly what's going on.iii
While tech people looked through the reams of nonsense available to try and discover if in fact this is some sort of an attack, somehowiv, a new drive was being installed to replace the failed drive. The system recovered on the fly, and about two hours later... the array was again degraded.
At this point we were pretty confident that noise on the power line is damaging the drives. The datacenter was able to provide readings from their outside cabinet sensor showing power being fine, but this fails to convince me. Consequently I will be scaling down the collaboration with this particular outfit over the coming months, notwithstanding a 7 year happy historyv.
This just about brings us up to date. The trade engine has been meticulously reconstructed and is being tested. MPEx itself is scheduled to come back online October the 30th, at 6 am GMTvi. At that point, user STATs will display correct positions as to BTC and symbols held both in account and on the exchange book as of October the 26th. Due to the way proxies currently work, however, the history of recent trades offered in the STAT statement as well as the history displayed on the main site will not list transactions that had occurred starting October 18th. This has some cosmetic disadvantages, as for instance the MPEx outage will appear significantly wider than it actually was, but otherwise should not cause trouble.
There are no code changes users are required to make in order to be able to continue using MPEx at that pointvii. As a matter of course everyone should verify their STAT and report any mismatches with previous STATs.
We remain as always firmly committed to providing a venue for Bitcoin denominated finance, even if in the end this will require 1:1 transformers in the shape of 100kg copper blocks to protect from evil military sattellites shooting microwaves into our server, or whatever the hell happened here. Sorry for the downtime!
———- MPEx security breach ; MPEx breach - post mortem [↩]
- If we're not cheapskates content to just format the failed drive and reconstruct afterwards. [↩]
- If you are curious, this is how a log looks normally
- October 1, 2013 8:38:31 AM CDT INF 423:A01C0S08L-- [anon] Drive map: controller 1, channel 0, SCSI device ID 8, controller 1, enclosure 0, slot 0, S/N CVEM047100Y4064KGN (Vendor: INTEL Model: SSDSA2SH064G1GC)
October 1, 2013 8:38:31 AM CDT INF 423:A01C0S09L-- [anon] Drive map: controller 1, channel 0, SCSI device ID 9, controller 1, enclosure 0, slot 1, S/N CVEM9252012C064KGN (Vendor: INTEL Model: SSDSA2SH064G1GC)
October 1, 2013 8:38:31 AM CDT INF 423:A01C0S010L-- [anon] Drive map: controller 1, channel 0, SCSI device ID 10, controller 1, enclosure 0, slot 2, S/N CVEM04620027064KGN (Vendor: INTEL Model: SSDSA2SH064G1GC)
October 1, 2013 8:38:31 AM CDT INF 423:A01C0S11L-- [anon] Drive map: controller 1, channel 0, SCSI device ID 11, controller 1, enclosure 0, slot 3, S/N CVEM04950103064KGN (Vendor: INTEL Model: SSDSA2SH064G1GC)
October 2, 2013 8:38:34 AM CDT INF 423:A01C0S08L-- [anon] Drive map: controller 1, channel 0, SCSI device ID 8, controller 1, enclosure 0, slot 0, S/N CVEM047100Y4064KGN (Vendor: INTEL Model: SSDSA2SH064G1GC)
October 2, 2013 8:38:34 AM CDT INF 423:A01C0S09L-- [anon] Drive map: controller 1, channel 0, SCSI device ID 9, controller 1, enclosure 0, slot 1, S/N CVEM9252012C064KGN (Vendor: INTEL Model: SSDSA2SH064G1GC)
October 2, 2013 8:38:34 AM CDT INF 423:A01C0S010L-- [anon] Drive map: controller 1, channel 0, SCSI device ID 10, controller 1, enclosure 0, slot 2, S/N CVEM04620027064KGN (Vendor: INTEL Model: SSDSA2SH064G1GC)
October 2, 2013 8:38:34 AM CDT INF 423:A01C0S11L-- [anon] Drive map: controller 1, channel 0, SCSI device ID 11, controller 1, enclosure 0, slot 3, S/N CVEM04950103064KGN (Vendor: INTEL Model: SSDSA2SH064G1GC)And this is how it looks abnormally :
-
October 26, 2013 8:39:13 AM CDT INF 423:A01C0S08L-- [anon] Drive map: controller 1, channel 0, SCSI device ID 8, controller 1, enclosure 0, slot 0, S/N CVEM047100Y4064KGN (Vendor: INTEL Model: SSDSA2SH064G1GC)
October 26, 2013 8:39:13 AM CDT INF 423:A01C0S09L-- [anon] Drive map: controller 1, channel 0, SCSI device ID 9, controller 1, enclosure 0, slot 1, S/N CVEM9252012C064KGN (Vendor: INTEL Model: SSDSA2SH064G1GC)
October 26, 2013 8:39:13 AM CDT INF 423:A01C0S010L-- [anon] Drive map: controller 1, channel 0, SCSI device ID 10, controller 1, enclosure 0, slot 2, S/N CVEM04620027064KGN (Vendor: INTEL Model: SSDSA2SH064G1GC)
October 26, 2013 8:39:13 AM CDT INF 423:A01C0S11L-- [anon] Drive map: controller 1, channel 0, SCSI device ID 11, controller 1, enclosure 0, slot 3, S/N CVEM04950103064KGN (Vendor: INTEL Model: SSDSA2SH064G1GC)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ORE;IGNORE #攨0
;IGNORE;IGNORE;IGNORE #洼490
;IGNORE;IGNORE;IGNORE #溛0
;IGNORE;IGNORE;IGNORE #漥2
;IGNORE;IGNORE;IGNORE #畖0
;IGNORE;IGNORE;IGNORE #穵0
;IGNORE;IGNORE;IGNORE #窊0
;IGNORE;IGNORE;IGNORE #窪3
;IGNORE;IGNORE;IGNORE #聉0 TZif2^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^M^@^@^@^M^@^@^@^Y^@^@^@B^@^@^@^M^@^@^@^Z 뵣 ^@^U'Sy^V^X<87> ^W^H<86> ^W k^X {^Y ^Z ?|ESC L<9C>^\ =<9C>^]<9C>.<9D>^^<8C>^_<9D>^_|^P<9D> l^A<9D>![ <9D>"K <9E>#;$+Ş%ESC <9E>&^K <9F>'^D ^_' ( 0)xk0) { * ^ + <88>!, y"-<94>j".<84>[#/tL#0d=#1]h
2rC 3=J 4R% 5^], 62^G 6 ^N 8ESC$&8 9 ^F&: Ҧ; &< &= &><85> &?<9A> &? <86>@e ^VA
<83><96>BE<87>^VCc<9C><96>D%i^WEC~<97>F^EK^WG#`<97>G g<97>I^CB<97>I I
<98>J$<98>K +<98>L A^XM<8E>^M<98>Nn^Bh^A^C^B^C^B^C^B^C^B^D^E^D^E^D^E^D^E^D^E^D^E^D^F^G^D^B^C^E^D^E^D^E^D^E^D^E^D^E^D^E^D^E^D^E^D^E^D^E^D
^H ^K^L^@^@^?^U^@^@^@^@p<80>^@^D^@^@<8C>^A ^@^@~<90>^@^D^@^@~<90>^@^D^@^@<8C> ^A ^@^@~<90>^A ^@^@p<80>^@^D^@^@<9A> ^A^O^@^@<8C>
^@^U^@^@<8C> ^@^U^@^@<9A> ^@^U^@^@<8C> ^@^DLMT^@YAKT^@YAKST^@VLAST
^@VLAT^@^DX^@^@^@^@^A^E ^A^@^@^@^B^G<86>^_<82>^@^@^@^C gS^C^@^@^@^D^KH<86><84>^@^@^@^E^M+^K<85>^@^@^@^F^O^L [↩]
- This is the biggest thing when you're running Bitcoin anything : you never know that anything happening isn't an attack, because everyone really is out to get you. [↩]
- Really, my oldest tickets in the system are 2.5k+ days ago. [↩]
- About 9 hours from now. [↩]
- Users who still haven't resubmitted their public key as per MPEx - Status Report are still required to do so. [↩]
Wednesday, 30 October 2013
Interesting. I hate hardware RAID for the principal reason that it is difficult if not impossible to truly know what the controller is doing. I know many think that doing RAID in userland is anathema but I had these kinds of issues on an adaptec controller until I gave up and went over to RAID-Z (zfs). Haven't noticed any appreciable difference in performance and reliability has actually increased. If I could find a RAID controller that was trustworthy and affordable perhaps my opinion would change.
Glad you guys are back online.
Wednesday, 30 October 2013
Interesting. I hate hardware RAID for the principal reason that it is difficult if not impossible to truly know what the controller is doing. I know many think that doing RAID in userland is anathema but I had these kinds of issues on an adaptec controller until I gave up and went over to RAID-Z (zfs). Haven't noticed any appreciable difference in performance and reliability has actually increased. If I could find a RAID controller that was trustworthy and affordable perhaps my opinion would change.
Glad you guys are back online.
Wednesday, 30 October 2013
Well this controller wasn't particularly affordable.
Wednesday, 30 October 2013
Funny isn't it how expensive decent RAID hardware is. There is nothing too special about the electronics used but clearly because it is a product intended at the "enterprise" segment they can afford to bump the price up a bit. Like SAS drives...