Sunday, 13 August 2017

IEX fee regression

Public exchanges are meant to promote efficient price discovery and risk management. IEX thwarts such efficiencies.

IEX's new fee scheme further damages both price discovery and risk management. Let's meander through this.

The Fall of Icarus, 17th century, Musée Antoine Vivenel
Here is the filing IEX lodged for the fee change:  SR-IEX-2017-27,
"a proposed rule change to increase the fees assessed under specified circumstances for execution of orders that take liquidity during periods when the IEX System has determined that a “crumbling quote” exists" [p3]
That is, IEX is hiking the fees for taking prices, erroneously called taking liquidity, to the maximum fee allowed by the SEC, $0.0030 per share, when their crumbling quote indicator (CQI) goes off 350 microseconds into the future for a period of 2.35 milliseconds into the future, from your external point of view.

Previously IEX used rebates, a subsidised price of zero in this case, for displayed price taking orders that complied with particular volume constraints. Well kind of, if you hit a displayed price, yes free, but maybe for maybe not if hitting non-displayed prices.  Specifically, taking non-displayed prices costs $0.0009 unless,
"Taking Non-Displayed Liquidity with a Displayable Order and at least 90% of TMVD was identified by IEX as Providing Displayed Liquidity (i.e., the Member’s execution reports reflect that the sum of executions with Fee Code L and a Last Liquidity Indicator (FIX tag 851) of '1' (Added Liquidity), divided by the sum of executions with Fee Code L, is at least 90% for the calendar month​)"  [IEX web]
At least that is the old text. IEX has also filed a rule change for this to be specific to a particular MPID: SR-IEX-2017-25,
"Taking Non-Displayed Liquidity with a Displayable Order and at least 90% of TMVD, on a per MPID basis, was identified by IEX as Providing Displayed Liquidity (i.e., the Member’s execution reports reflect that the sum of executions with Fee Code L and a Last Liquidity Indicator (FIX tag 851) of '1' (Added Liquidity), divided by the sum of executions with Fee Code L, is at least 90% for the calendar month)" [p20]
The definition of "TMVD" was also changed to include an MPID reference,
""TMVD" means total monthly volume displayable calculated as the sum of executions from each of the Member's MPID’s (on a per MPID basis) displayable orders during the calendar month." [p19]
The MPID change is to be effective from September 1st, 2017.

Interestingly in this SR-IEX-2017-25, IEX admits it has been charging members incorrectly as it had been using an MPID based formula all along instead of the published and approved member method. Did IEX report the billing violation to the SEC as a separate event? Should the SEC step in and fine IEX for incorrect billing?
"IEX reviewed Member invoices since its launch as an exchange in August 2016 through June 30, 2017 to assess whether any Members were charged fees that differed from those described in the Fee Schedule. In other words, IEX recalculated the Non-Displayed Match Fee and the 90% threshold exception on a “per Member” basis (which is how the Fee Schedule currently reads) instead of on a “per MPID” basis (which is how IEX in practice had been calculating that fee). This assessment identified that nine Members were charged such differential fees in particular months, in some cases more than the fees described in the Fee Schedule and in some cases less than the fees described in the Fee Schedule. In total, seven Members were charged and paid $18,948.54 in excess fees and eight Members were not charged $44,175.28 in fees that should have been charged. Five Members were overcharged and undercharged in different months." [p14]
To add insult to injury, IEX is going after those people it has been incorrectly undercharging for the last twelve months. Bumper bills in September,
"IEX will charge..each impacted member for the net amount..underpaid and will be included in the August 2017 monthly invoices to be sent in September 2017" [p14]
I'm not sure how I'd define great customer service, but this would not be it.

Let's meander back to the main issue of charging the maximum fee possible for CQI conditions. It is not quite as simple as just charging the maximum fee under those conditions. IEX applies some threshold relief. The wording is a little poorly written for a formal document, but the idea is that the big fee applies if you do at least one million shares a month then it applies to the number of taken prices above more the number represented by 5% of the total executions, on an MPID basis.
"At the end of each calendar month, executions with Fee Code Q that exceed the CQRF Threshold are subject to the Crumbling Quote Remove Fee. Otherwise, to the extent a Member receives multiple Fee Codes on an execution, the lower fee shall apply."
" “CQRF Threshold” means the Crumbling Quote Remove Fee Threshold. The threshold is equal to 5% of the sum of a Member’s total monthly executions on IEX if at least 1,000,000 shares during the calendar month, measured on an MPID basis."
"Executions with Fee Code Q that exceed the CQRF Threshold are subject to the Crumbling Quote Remove Fee."
Apart from trying to make NYSE American's task harder, IEX's goal is to prevent adverse selection against price providers.

IEX reports,
"Across all approximately 8,000 symbols available for trading on IEX, the CQI is on only 1.24 seconds per symbol per day on average (0.005% of the time during regular market hours), but 30.4% of marketable orders are received during those time periods, which indicates that certain types of trading strategies are seeking to aggressively target liquidity providers during periods of quote instability. " [p26]
That is, IEX is looking to dramatically increase fees on 30.4% of marketable orders. If you read the bold statement in bold above you might find yourself nodding. IEX overstates this. Remember the CQI applies for 2,000,000 nanoseconds after it is triggered. When the CQI is a true positive, this means that if you want to trade on IEX when the price changes, then you pay a premium.

That is, IEX applies the highest price it legally can to discourage trading around the time the price changes. That is a harsh penalty that impacts the efficiency of both price discovery and risk management. I guess it is just the important times; those times prices change. Why would a trader want to trade at important times? Such an attitude goes against the explicit goals the SEC has memorialized many times with regard to the purpose of the National Market System. Then again, if you're a Franken-pool that prefers dark trading, why not permeate further your destructive to public market interest microstructure.

Another important feature of such a beast is that you can't always really decide in advance if your order will be subject to the CQI as IEX has the benefit of last-look, or looking into the future, within the exchange. You may only know after the event that a CQI applies, but not as you place the trade. Trading with an unknown fee may be less than optimal for some institutions. Best execution obligations are certainly harder. Perhaps it is best not to trade at IEX if you may be inadvertently violating best-ex.

Another amusing aside to the silliness of it all comes from the poor implementation of the CQI. When IEX changed to their new IEX Signal implementation of the CQI, they reported,
"On our example day of December 15, 2016, ... This new candidate formula would have produced about 2 million true positives and 2.1 million false positives." [The Evolution of the Crumbling Quote Signal, Allison Bishop, p28]
IEX has a pretty dumb one-size-fits-all CQI implementation that has more false positives than true positives - according to IEX. False positive domination means the CQI is normally fake news. That is, the majority of the time IEX charges you the SEC's legally maximum possible fee, their invalid rationale is invalid. You couldn't make this up if it wasn't true.

There are also two humorous outcomes relating to IEX's routing implementation. IEX's routed orders may be subject to an excessive CQI fee for a marketable order, particularly large institutional orders. Even funnier, it may perhaps be a best execution requirement that if there are shares available for an order elsewhere, the IEX router should not route to IEX as this will save clients' money due to avoiding the high IEX price taking fees. If IEX does not comply, then you'd hope the SEC will take action against IEX's violation of best-ex obligations. It would be funny to see IEX fined for routing orders to itself. That'd definitely be worth a chuckle.

IEX leaks information by design. It is more subject to latency arbitrage due to its SIP leakage and lack of fair co-lo. It not only prevents trading at price change time with its dark fading orders but now wants to discourage price discovery and risk management with high fees for when trading is needed most - at times of change. What happened to simple and few order types with simple and transparent pricing?

The IEX cult is becoming a lot more like the Flat Earth Society. I wonder if the mainstream media will ever call out IEX's misleading hypocrisy for the hubristic bullshit it truly is?

Happy trading,



PS: IEX, instead of being greedy by taking the fee for yourself, you could generously provide some of it to the price provider. Would that be a compensatory rebate or a kickback?

Friday, 30 June 2017

FPGAs and AI processors: DNN and CNN for all

Here is a nice hidden node from a traditional 1990's style gender identification neural net I did a few weeks ago.

A 90's style hidden node image in a simple gender identifier net
Source: my laptop
My daughter was doing a course as part of her B Comp Eng, the degree after her acting degree. Not being in the same city I thought maybe I could look at her assignment and help in parallel. Unsurprisingly, she didn't need nor want my help. No man-splaining necessary from the old timer father. Nevertheless, it was fun to play with the data Bec pulled down on faces. Bec's own gender id network worked fine for 13 out of 14 photos of herself fed into the trained net. Nice.

I was late to the party and first spent time with neural nets in the early nineties. As a prop trader at Bankers Trust in Sydney, I used a variety of software including a slightly expensive graphical tool from NeuroDimension that also generated C++ code for embedding. It had one of those parallel port copy protection dongles that were a pain. I was doing my post-grad at a group at uni that kept changing its name from something around connectionism, to adaptive methods, and then data fusion. I preferred open source and the use of NeuroDimension waned. I ported the Stuttgart Neural Network Simulator, SNNS, to the new MS operating system, Windows NT (with OS/3 early alpha branding ;-) ), and briefly became the support guy for that port. SNNS was hokey code with messy static globals but it worked pretty fly for a white guy.

My Master of Science research project was a kind of cascade correlation-like neural net, Multi-rate Optimising Order Statistic Equaliser (MOOSE), for intraday Bund trading. The MOOSE was a bit of work designed for acquiring fast LEO satellite signals (McCaw's Teledesic), repurposed for playing with Bunds as they migrated from LIFFE to DTB. As a prop trader at an investment bank, I could buy neat toys. I had the world's fastest computer at the time: an IBM MicroChannel dual Pentium Pro 200MHz processors plus SCSI with some megabytes of RAM. Pulling 800,000 points into my little C++ stream/dag processor seemed like black magic in 1994. Finite differencing methods let me do oodles of O(1) incremental linear regressions and the like to get 1000 fold speed-ups. It seemed good at the time. Today, your phone would laugh in my general direction.

There was plenty of action in neural nets back in those days. Not much of it was overly productive but it was useful. I was slightly bemused to read Eric Schmidt's take on machine learning and trading in Lindsay Fortado and Robin Wigglesworth's FT article "Machine learning set to shake up equity hedge funds",
Eric Schmidt, executive chairman of Alphabet, Google’s parent company, told a crowd of hedge fund managers last week that he believes that in 50 years, no trading will be done without computers dissecting data and market signals.
“I’m looking forward to the start-ups that are formed to do machine learning against trading, to see if the kind of pattern recognition I’m describing can do better than the traditional linear regression algorithms of the quants,” he added. “Many people in my industry think that’s amenable to a new form of trading.”
Eric, old mate, you know I was late to the party in the early nineties, what does that make you?

Well, things are different now. I like to think of it and have written about the new neural renaissance as The Age of Perception. It is not intelligence, it is just good at patterns. It is still a bit hopeless at language ambiguities. It will also be a while before it understands the underlying values and concepts for deep financial understanding. 

Deep learning is simultaneously both overhyped and underestimated. It is not intelligence, but it will help us get there. It is overhyped by some as an AI breakthrough that will give us cybernetic human-like replicants. We still struggle with common knowledge and ambiguity in simple text for reasoning. We have a long way to go. The impact of relatively simple planning algorithms and heuristics along with the dramatic deep learning based perception abilities from vision, sound, text, radar, et cetera, will be as profound as every person and their dog now understands. That's why I call it, The Age of Perception. It is as if the supercomputers in our pockets have suddenly awoken with their eyes quickly adjusting to the bright blinking blight that is the real world. 

The impact will be dramatic and lifestyle changing for the entire planet. Underestimate the impact at your peril. No, we don't have a date with a deep Turing conversationalist that will provoke and challenge our deepest thoughts - yet. That will inevitably come, but it is not on the visible horizon. Smart proxies aiding by speech, text and Watson-like Jeopardy databases will give a very advanced Eliza, but no more. Autonomous transport, food production, construction, yard and home help will drive dramatic lifestyle and real-estate value changes.

Apart from this rambling meander, my intention here was to collect some thoughts on the chips driving the current neural revolution. Not the most exciting thought for many, but it is a useful exercise for me.

Neural network hardware

Neural processing is not a lot different today compared to twenty years ago. Deep is more of a brand than a difference. The activation functions have been simplified which suits hardware better. Mainly there is more data and a better understanding of how to initialise the weights, handle many layers, parallelise, and improve robustness via techniques such as dropout. The Neocognitron architecture from 1980 is not much different to today's deep learner or CNN, but it helped that Yann LeCun allowed it to learn. 

Back in the nineties there was also plenty of neural hardware platforms such as CNAPS (1990) with its 64 processing units and 256kB of memory for doing 1.6 GCPS (connections per second CPS) for 8/16-bit or 12.8 GCPS for 1-bit. You can read about Synapse-1, CNAPS, SNAP, CNS Connectionist Supercomputer, Hitachi WSI, My-Neupower, LNeuro 1.0, UTAK1, GNU Implementation (no, not GNU GNU, General Neural Unit), UCL, Mantra 1, Biologically-Inspired Emulator, INPG Architecture, BACHUS, and ZISC036 in "Overview of neural hardware", [Heemskerk, 1995, draft].

Phew, it seems a lot but that excluded the software and accelerator board/CPU combos, such as ANZA plus, SAIC SIGMA-1, NT6000, Balboa 860 coprocessor, Ni1000 Recognition Accelerator Hardware (Intel), IBM NEP, NBC, Neuro Turbo I, Neuro Turbo II, WISARD, Mark II & IV, Sandy/8, GCN (Sony), Topsi, BSP400 (400 microprocessors), DREAM Machine, RAP, COKOS, REMAP, General Purpose Parallel Neurocomputer, TI NETSIM, and GeNet. Then there were quite a few analogue and hybrid analogue implementations, including Intel's Electrically Trainable Analog Neural Network (801770NX). You get the idea, there was indeed a lot back in the day.

All a go go in 1994:

Optimistically Moore's Law was telling us a TeraCPS was just around the corner,
"In the next decade micro-electronics will most likely continue to dominate the field of neural network implementation. If progress advances as rapidly as it has in the past, this implies that neurocomputer performances will increase by about two orders of magnitude. Consequently, neurocomputers will be approaching TeraCPS (10^12 CPS) performance. Networks consisting of 1 million nodes, each with about 1,000 inputs, can be computed at brain speed (100-1000 Hz). This would offer good opportunities to experiment with reasonably large networks."
The first neural winter was the cruel subversion of research dollars by Minsky and Papert's dissing of Rosenblatt's perceptron dream with incorrect hand-wavy generalisations about hidden layers that ultimately led to Rosenblatt's untimely death. In 1995 another neural winter was kind of underway although I didn't really know it at the time. As a frog in the saucepan, I didn't notice the boil. This second winter was fired up by a lack of exciting progress and general boredom. 

The second neural winter ended with the dramatic improvements in ImageNet processing with the University of Toronto's SuperVision from AlexNet in 2012 thanks to Geoffrey Hinton's winter survival skills. This result was then blown apart by Google's LeNet 2014 Inception model. So, the Age of Perception started in 2012 by my reckoning. Mark your diaries. We're now five years in.

Google did impressive parallel CPU work with lossy updates across a few thousand regular machines. Professor Andrew Ng and friends made the scale approachable by enabling dozens of GPUs to do the work of thousands of CPUs. Thus, we were saved from the prospect of neural processing being only for the well funded. Well, kind of, now the state of the art sometimes needs thousands of GPUs or specific chips. 

More data and more processing have been quite key. Let's get to the point and list some of the platforms that are key to the Age of Perception's big data battle:

GPUs from Nvidia

These are hard to beat. The subsidisation that comes from the large video processing market drives tremendous economies of scale. The new Nvidia V100 can do 15 TFlops of SP or 120 TFlops with its new Tensor core architecture which is a FP16 multiply and FP32 accumulate or add to suit ML. Nvidia is packing up 8 boards into their DGX-1 for 960 Tensor TFlops. 

GPUs from AMD

AMD has been playing catch-up with Nvidia in the ML space. The soon to be released AMD Radeon Instinct MI25 is promising 12.3 TFlops of SP or 24.6 TFlops of FP16. If your calculations are amenable to Nvidia's Tensors, then AMD can't compete. Nvidia also does twice the bandwidth with 900GB/s versus AMD's 484 GB/s. 

Google's TPUs

Google's original TPU had a big lead over GPUs and helped power DeepMind's AlphaGo victory over Lee Sedol in a Go tournament. The original 700MHz TPU is described as having 95 TFlops for 8-bit calculations or 23 TFlops for 16-bit whilst drawing only 40W. This was much faster than GPUs on release but is now slower than Nvidia's V100, but not on a per W basis. The new TPU2 is referred to as a TPU device with four chips and can do around 180 TFlops. Each chip's performance has been doubled to 45 TFlops for 16-bits. You can see the gap to Nvidia's V100 is closing. You can't buy a TPU or TPU2. Google is making them available for use in their cloud with TPU pods containing 64 devices for up to 11.5 PetaFlops of performance. The giant heatsinks on the TPU2 are some cause for speculation, but the market is changing from devices to units with groups of devices and also such groups within the cloud.

Wave Computing

Wave's Aussie CTO, Dr Chris Nicol, has produced a wonderful piece of work with Wave's asynchronous data flow processor in their Compute Appliance. I was introduced to Chris briefly a few years ago in California by Metamako Founder Charles Thomas. They both used to work on clockless async stuff at NICTA. Impressive people those two. 

I'm not sure Wave's appliance was initially targeting ML but their ability to run TensorFlow at 2.9 PetaOPS/sec on their 3RU appliance is pretty special. Wave refers to their processors at DPUs and an appliance has 16 DPUs. Wave uses processing elements it calls Coarse Grained Reconfigurable Arrays (CGRAs). It is unclear what bit width the 2.9 PetaOPS/s is referring to. From their white paper, the ALUs can do 1b, 8b, 16b and 32b,  
"The arithmetic units are partitioned. They can perform 8-b operations in parallel (ideal for DNN inferencing) as well as 16-b and 32-b operations (or any combination of the above). Some 64-b operations are also available and these can be extended to arbitrary precision using software.
Here is a bit more on one of the 16 DPUs included in the appliance,
"The Wave Computing DPU is an SoC that contains a 16,384 PEs, configured as a CGRA of 32x32 clusters. It includes four Hybrid Memory Cube (HMC) Gen 2 interfaces, two DDR4 interfaces, a PCIe Gen3 16-lane interface and an embedded 32-b RISC microcontroller for SoC resource management. The Wave DPU is designed to execute autonomously without a host CPU."
On TensorFlow ops, 
"The Wave DNN Library team creates pre-compiled, relocatable kernels for common DNN functions used by workflows like TensorFlow. These can be assembled into Agents and instantiated into the machine to form a large data flow graph of tensors and DNN kernels."
"...a session manager that interfaces with machine learning workflows like TensorFlow, CNTK, Caffe and MXNet as a worker process for both training and inferencing. These workflows provide data flow graphs of tensors to worker processes. At runtime, the Wave session manager analyzes data flow graphs and places the software agents into DPU chips and connects them together to form the data flow graphs. The software agents are assigned regions of global memory for input buffers and local storage. The static nature of the CGRA kernels and distributed memory architecture enables a performance model to accurately estimate agent latency. The session manager uses the performance model to insert FIFO buffers between the agents to facilitate the overlap of communication and computation in the DPUs. The variable agents support software pipelining of data flowing through the graph to further increase the concurrency and performance. The session manager monitors the performance of the data flow graph at runtime (by monitoring stalls, buffer underflow and/or overflow) and dynamically tunes the sizes of the FIFO buffers to maximize throughput. A distributed runtime management system in DPU-attached processors mounts and unmounts sections of the data flow graph at run time to balance computation and memory usage. This type of runtime reconfiguration of a data flow graph in a data flow computer is the first of its kind."
Yeah, me too. Very cool.

The exciting thing about this platform is that it is coarser than FPGA in architectural terms and thus less flexible, but likely to perform better. Very interesting.

KnuEdge's KnuPath

I tweeted about KnuPath back in June 2016. Their product page has since gone missing in action. I'm not sure what they are up to with the $100M they put into their MIMD architecture. It was described at the time as having 256 tiny DSP, or tDSP, cores on each ASIC along with an ARM controller suitable for sparse matrix processing in a 35W envelope. 

(source: HPC Wire - click to enlarge)
The performance is unknown, but they compared their chip to a current NVIDIA, at that time, and said they had 2.5 times the performance. We know Nvidia is now more than ten times faster with their Tensor cores so KnuEdge will have a tough job keeping up. A MIMD or DSP approach will have to execute awfully well to take some share in this space. Time will tell. 

Intel's Nervana

Intel purchased Nervana Systems who was developing both a GPU/software approach in addition to their Nervana Engine ASIC. Comparable performance is unclear. Intel is also planning in integrating into the Phi platform via a Knights Crest project. NextPlatform suggested the 2017 target on 28nm may be 55 TOPS/s for some width of OP. There is a NervanaCon Intel has scheduled for December, so perhaps we'll see the first fruits then.

Horizon Robotics

This Chinese start-up has a Brain Processing Unit (BPU) in the works. Dr Kai Yu has the right kind of pedigree as he was previously the head of Baidu's Institute of Deep Learning. Earlier this year a BPU emulation on an Arria 10 FPGA was shown in this Youtube clip. There is little information on this platform in public.


Eyeriss is an MIT project that developed a 64nm ASIC with unimpressive raw performance. The chip is about half the speed of a Nvidia TK1 on AlexNet. The neat aspect was that such middling performance was achieved by a 278mW reconfigurable accelerator thanks to its row stationary approach. Nice.


Graphcore raised $30M of Series-A late last year to support the development of their Intelligence Processing Unit, or IPU. Their web is a bit sparse on details with hand-wavy facts such at >14,000 independent processor threads and >100x memory bandwidth. Some snippets have snuck out with NextPlatform reporting over a thousand true cores on the chip with a custom interconnect. It's PCIe board has a 16-processor element. It sounds kind of dataflowy. Unconvincing PR aside, the team has a strong rep and the investors are not naive, so we'll wait and see.


Tenstorrent is a small Canadian start-up in Toronto claiming an order of magnitude improvement in efficiency for deep learning, like most. No real public details but they're are on the Cognitive 300 list.


Cerebras is notable due to its backing from Benchmark and that its founder was the CEO of SeaMicro. It appears to have raised $25M and remains in stealth mode.


Thinci is developing vision processors from Sacremento with employees in India too. They claim to be at the point of first silicon, Thinci-tc500, along with benchmarking and winning of customers already happening. Apart from "doing everything in parallel" we have little to go on.


Koniku's web site is counting down and has 72 days showing until my new reality. I can hardly wait. They have raised very little money and after watching their Youtube clip embedded in this Forbes page, you too will not likely not be convinced, but you never know. Harnessing biological cells is certainly different. It sounds like a science project, but, then this,
"We are a business. We are not a science project," Agabi, who is scheduled to speak at the Pioneers Festival in Vienna, next week, says, "There are demands that silicon cannot offer today, that we can offer with our systems."
The core of the Koniku offer is the so-called neuron-shell, inside which the startup says it can control how neurons communicate with each other, combined with a patent-pending electrode which allows to read and write information inside the neurons. All this packed in a device as large as an iPad, which they hope to reduce to the size of a nickel by 2018.


Adapteva is a favourite little tech company of mine to watch as you'll see in this previous meander, "Adapteva tapes out Epiphany-V: A 1024-core 64-bit RISC processor." Andreas Olofsson taped out his 1024 core chip late last year and we await news of its performance. Epiphany-V has new instructions for deep learning and we'll have to see if this memory-controller-less design with 64MB of on-chip memory will have appropriate scalability. The impressive efficiency of Andrea's design and build may make this a chip we can all actually afford, so let's hope it performs well.


Knowm talks about Anti-Hebbian and Hebbian (AHaH) plasticity and memristors. Here is a paper covering the subject, "AHaH Computing–From Metastable Switches to Attractors to Machine Learning." It's a bit too advanced for me. With a quick glance I can't tell the difference between this tech and hocus-pocus but it looks sciency. I'm gonna have to see this one in the flesh to grok it. The idea of neuromemristive processors is intriguing. I do like a good buzzword in the morning.


A battery powered neural chip from Mythic with 50x lower power. Not so many real details out there. The chip is the size of a button, but aren't most chips?
"Mythic's platform delivers the power of desktop GPU in a button-sized chip"
Perhaps another one that is suitable for drones and phones that is likely to be eaten or sidelined by a phone.


Phones are an obvious place for ML hardware to crop up. We want to identify the dog type, flower, leaf, cancerous mole, translate a sign, understand the spoken word, etc. Our pocket supercomputers would like all the help they can get for the Age of Perception.

Qualcomm has been fussing around ML for a while with the Zeroth SDK and Snapdragon Neural Processing Engine. The NPE certainly works reasonably well on the Hexagon DSP that Qualcomm use. The Hexagon DSP is far from a very wide parallel platform and it has been confirmed by Yann LeCun that Qualcomm and Facebook are working together on a better way in Wired's "The Race To Build An AI Chip For Everything Just Got Real",
"And more recently, Qualcomm has started building chips specifically for executing neural networks, according to LeCun, who is familiar with Qualcomm's plans because Facebook is helping the chip maker develop technologies related to machine learning. Qualcomm vice president of technology Jeff Gehlhaar confirms the project. "We're very far along in our prototyping and development," he says."
Perhaps we'll see something soon beyond the Kryo CPU, Adreno GPU, Hexagon DSP, and Hexagon Vector Extensions. It is going to be hard to be a start-up in this space if you're competing against Qualcomm's machine learning.

Pezy-SC and Pezy-SC2

These are the 1024 core and 2048 core processors that Pezy develop. The Pezy-SC 1024 core chip powered the top 3 systems on the Green500 list of supercomputers back in 2015. The Pezy-SC2 is the follow up chip that is meant to be delivered by now, and I do see a talk in June about it, but details are scarce yet intriguing,
"PEZY-SC2 HPC Brick: 32 of PEZY-SC2 module card with 64GB DDR4 DIMM (2.1 PetaFLOPS (DP) in single tank with 6.4Tb/s"
It will be interesting to see what  2,048 MIMD MIPS Warrior 64-bit cores can do. In the June 2017 Green500 list, a Nvidia P100 system took the number one spot and there is a Pezy-SC2 system at number 7. So the chip seems alive but details are thin on the ground. Motoaki Saito is certainly worth watching.


Despite many promises, Kalray has not progressed their chip offering beyond the 256 core beast I covered back in 2015, "Kalray - new product meander." Kalray is advertising their product as suitable for embedded self-driving car applications though I can't see the product architecture being an ideal CNN platform in its current form. Kalray has a Kalray Neural Network (KaNN) software package and claims better efficiency than GPUs with up to 1 TFlop/s on chip.

Kalrays NN fortunes may improve with an imminent product refresh and just this month Kalray completed a new funding that raised $26M. The new Coolidge processor is due in mid-2018 with 80 or 160 cores along with 80 or 160 co-processors optimised for vision and deep learning.

This is quite a change in architecture from their >1000 core approach and I think it is most sensible.

IBM TrueNorth

TrueNorth is IBM's Neuromorphic CMOS ASIC developed in conjunction with the DARPA SyNAPSE program.
It is a manycore processor network on a chip design, with 4096 cores, each one simulating 256 programmable silicon "neurons" for a total of just over a million neurons. In turn, each neuron has 256 programmable "synapses" that convey the signals between them. Hence, the total number of programmable synapses is just over 268 million (228). In terms of basic building blocks, its transistor count is 5.4 billion. Since memory, computation, and communication are handled in each of the 4096 neurosynaptic cores, TrueNorth circumvents the von-Neumann-architecture bottlenecks and is very energy-efficient, consuming 70 milliwatts, about 1/10,000th the power density of conventional microprocessors. [Wikipedia]
Previously criticised for running spiking neural networks rather than being fit for deep learning, IBM developed a new algorithm for running CNNs on TrueNorth,
Instead of firing every cycle, the neurons in spiking neural networks must gradually build up their potential before they fire...Deep-learning experts have generally viewed spiking neural networks as inefficient—at least, compared with convolutional neural networks—for the purposes of deep learning. Yann LeCun, director of AI research at Facebook and a pioneer in deep learning, previously critiqued IBM’s TrueNorth chip because it primarily supports spiking neural networks... 
...the neuromorphic chips don't inspire as much excitement because the spiking neural networks they focus on are not so popular in deep learning.
To make the TrueNorth chip a good fit for deep learning, IBM had to develop a new algorithm that could enable convolutional neural networks to run well on its neuromorphic computing hardware. This combined approach achieved what IBM describes as “near state-of-the-art” classification accuracy on eight data sets involving vision and speech challenges. They saw between 65 percent and 97 percent accuracy in the best circumstances.
When just one TrueNorth chip was being used, it surpassed state-of-the-art accuracy on just one out of eight data sets. But IBM researchers were able to boost the hardware’s accuracy on the deep-learning challenges by using up to eight chips. That enabled TrueNorth to match or surpass state-of-the-art accuracy on three of the data sets.
The TrueNorth testing also managed to process between 1,200 and 2,600 video frames per second. That means a single TrueNorth chip could detect patterns in real time from between as many as 100 cameras at once..." [IEEE Spectrum]
Power efficiency is quite brilliant on TrueNorth and makes it very worthy of consideration.

Brainchip's Spiking Neuron Adaptive Processor (SNAP)

SNAP will not do deep learning and is a curiosity without being a practical drop in CNN engineering solution, yet. IBM's stochastic phase-change neurons seem more interesting if that is a path you wish to tread.

Apple's Neural Engine

Will it or won't it?  Bloomberg is reporting it will as a secondary processor but there is little detail. Not only is it an important area for Apple, but it helps avoid and compete with Qualcomm.


Cambricon - Chinese Academy of Sciences invests $1.4M  for chip. It is an instruction set architecture for NNs with data-level parallelism, customised vector/matrix instructions, on-chip scratchpad memory. Claims 91 times CPU-x86 and 3 times faster than a K40M with 1% or 1.695W of peak power use. "Cambricon-X: An Accelerator for Sparse Neural Networks" and "Cambricon: An Instruction Set Architecture for Neural Networks."

Ex-googlers and groq inc. Perhaps another TPU?


Deep Vision is bulding low-power chips for deep learning. Perhaps one of these papers by the founders have clues, "Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing" [2013] and "Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing" [2015].

Deep Scale.

Reduced Energy Microsystems are developing lower power asynchronous chips to suit CNN inference. REM was Y Combinator's first ASIC venture according to TechCrunch.

Leapmind is busy too.


Microsoft has thrown its hat into the FPGA ring, "Microsoft Goes All in for FPGAs to Build Out AI Cloud." Wired did a nice story on the MSFT use of FPGAs too, "Microsoft Bets Its Future on a Reprogrammable Computer Chip"
"On Bing, which an estimated 20 percent of the worldwide search market on desktop machines and about 6 percent on mobile phones, the chips are facilitating the move to the new breed of AI: deep neural nets."
I have some affinity for this approach. Xilinx and Intel's (nee Altera) FPGAs are powerful engines. Xilinx naturally claim their FPGA's are best for INT8 with one of their white papers containing the following slide,

Both vendors have good support for machine learning with their FPGAs:

Whilst performance per Watt is impressive for FPGAs, the vendors' larger chips have long had earth shatteringly high chip prices for the larger chips. Xilinx's VU9P lists at over $US 50k at Avnet.

Finding a balance between price and capability is the main challenge with the FPGAs.

One thing that is to love about the FPGA approach is the ability to make some quite wonderful architectural decisions. Say you want to improve you memory streaming of floating point via compressing off board DRAM for HBM and uncompress it in real time, there is a solution if you try hard enough, "Bandwidth Compression of Floating-Point Numerical Data Streams for FPGA-Based High-Performance Computing"

This kind of dynamic architectural agility would be a hard thing to pull off with almost any other technology.

Too many architectural choices may be considered a problem, but I kind of like that problem myself. Here is a nice paper on closing the performance gap between custom hardware and FPGA processors with an FPGA-based horizontally microcoded compute engine that reminds of the old DISC or discrete instruction set computer from many moons ago, "Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT"


Trying to forecast a winner in this kind of race is a fool's errand. Qualcomm will be well placed simply due to their phone dominance. Apple will no doubt succeed with whatever they do. Nvidia's V100 is quite a winner with its Tensor units. I'm not sure I can see Google's TPU surviving in a relentless long-term silicon march despite its current impressive performance. I'm fond of the FPGA approach but I can't help but think they should release DNN editions at much cheaper price points so that they don't get passed by the crowd. Intel and AMD will have their co-processors. As all the major players are mucking in, much of it will come down to supporting standard toolkits, such as TensorFlow, and then we will not have to care too much about the specifics, just the benchmarks.

From the smaller players, as much as I like and am cheering for the Adapteva approach I think their memory architecture may not be well suited to DNN. I hope I'm wrong.

Wave Computing is probably my favourite approach after FPGAs. Their whole asynchronous data flow approach is quite awesome. It appears REM is doing something similar, but I think they may be too late. Will Wave Computing be able to hold their head up in face of all the opposition? Perhaps as their asynchronous CGRA has an inherent advantage. Though I'm not sure they need just DNNs to succeed as their tech has much broader applicability.

Neuromorphic spiking processor thing-a-ma-bobs are probably worth ignoring for now but keep your eye on them due to their power advantage. Quantum crunching may make it all moot anyway. The exception to this rule is probably IBM's TrueNorth thanks to its ability to not just do spiking networks but to also run DNNs efficiently.

For me, Random Forests are friendly. They are much harder to screw up ;-)

Happy trading,


Wednesday, 28 June 2017

U.S. Equity Market Structure Part I: A Review of the Evolution of Today’s Equity Market Structure and How We Got Here

If you have three hours and forty-six minutes of time to kill then you should probably not watch the Committee on Financial Services testimony from earlier today, Huonville time:

This is the Committee web reference reproduced:

Hearing entitled “U.S. Equity Market Structure Part I: A Review of the Evolution of Today’s Equity Market Structure and How We Got Here” 
Tuesday, June 27, 2017 10:00 AM in 2128 Rayburn HOB 
Capital Markets, Securities, and Investment

Click here for the Committee Memorandum.
Witness List
Panel I
Panel II
I think Mr Larry Tabb summed up the prospects for reform nicely in one of his recent Market Structure Weekly video pieces,
"Tabb dissects the debate over US equities market structure and Reg NMS, and the difficulties in reaching a consensus."
That is, there is unlikely to be any consensus anytime soon. However, some rays of hope did appear. I interpreted there to be general support for:
  • getting more companies public;
  • tick size variation;
  • depth of market being added to SIP and further SIP improvements; and,
  • support for better disclosure on market performance and routeing.
Otherwise, most of the committee testimony pointed to differences of opinion. Even though most parties suggested a thorough review should take place, Mr Joe Saluzzi suggested this should not happen and only certain aspects should be reviewed. Mr Saluzzi spoke well but dropped a little clanger when he misled the committee and told them that SIP feeds could be used for pricing PFOF when that has been against regulation since Nov 2015.

The only truly bad behaviour was from Mr Brad Katsuyama. His referral to rebates as kickbacks and talk of syphoning off $2.5B in kickbacks as part of a corrupt system was at best an inconsiderate use of language and at worst libel. I've discussed this previously here:
Near the end, Mr Chris Concannon showed some backbone and started to dig into Mr Katsuyama's falsehoods with a muted degree of fury but the time pocketed format didn't really allow much debate.

It is quite impressive the delusion IEX continues to suffer from. They really don't understand the harm they are doing to the market and their own customers. I've more than covered that to death in the past and it is getting tired, so here are the highlights from older meanderings:
Mr David Weisberger wrote a more pointed criticism of Mr Katsuyama's testimony, "ViableMkts ANNOTATION of the Testimony of Investors Exchange Chief Executive Officer Bradley Katsuyama."

You have to give credit to Mr Katsuyama, he really believes he is doing the right thing. He doesn't understand the harm he is doing:
  • Lack of price discovery via a preponderance of dark liquidity;
  • Speed-bump flaws that expose client orders to others before they may receive their own notifications exposing their clients to latency arbitrage in a way that is worse than all other exchanges;
  • Expensive transaction costs for majority of their orders;
  • Complex order types instead of the "three only" simple order types they originally made the case for in "Flash Boys";
  • Unfairness of a lack of co-lo where traders can game POP access and get latency advantages;
  • The need for sophisticated infrastructure including multiple sites with RF or laser required to maximise information and minimise leakage from IEX;
  • A large degree of false positives from a poor one size fits all Crumbling Quote Indicator (CQI) that will lose priority too easily;
  • The CQI preventing the ability to trade in a moving market - increasing risk;
  • Excessive potential to miss fills and let the market move away;
  • Preventing innovation with the wrong kind of flawed innovation;
  • Misleading market statistics due their dark reliance and lack of trade on market ticks;
  • Poor displayed liquidity with only CHX having shorter queues showing the difficulty with, and fragility of IEX's displayed market; and,
  • MM-Peg latency issues, despite it being a post only order that is not expected to trade.
IEX has real problems, but not that you would know from their marketing.

On the consensus points, Mr John Comerford from Instinet chose to focus on the problem with the one size fits all tick size. He pointed out that the current tick size was only really appropriate for a third of the market. Tick sizes for other stocks were both too big, or too small. This was a great focal point and one that didn't see much disagreement. Mr Tom Wittman supported the idea of "intelligent tick sizes" that Nasdaq had also raised at the last EMSAC. This is an obvious thing that needs to be done.

PFOF was politely contentious. Without tick size adjustment there is no real way that public exchanges can compete against dark sub-penny increments, including PFOF. Retail would be worse off if PFOF were simply eliminated, see:
Sub-pennies rule!
Another important point that seemed to engender consensus was the need for better information and analysis around routeing and trade reporting. That would be a good thing to move forward as too much remains in the dark or is too onerous to analyse.

I was a bit surprised by the olive branch that seemed to be held out by the exchanges on the SIP. There seemed to be non-opposition for adding depth to the SIP. That would be an advance. Perhaps it is a deferment to try to take the heat of their ongoing market data fee argument?

One exchange was a bit misleading with the idea that direct feeds from exchanges were subject to competition. That claim was a bit cheeky. Mr Saluzzi quite correctly disputed that idea. There remains much consternation around market data costs and fees. The exchanges will stoutly defend this territory.

Mr Ari Rubenstein from GTS made a host of decent points. The one that showed an unfortunate bias was the claim that the BATS closing auction would harm the market. That is a hard proposition to support. Mr Rubenstein's position as a large DMM at NYSE is an obvious conflict.

Mr Jeff Brown spoke well as a representative of Schwab but lost his way for a moment in the defence of PFOF. I'm not fond of PFOF but do accept that it delivers unassailable benefits to retail thanks to the sub-penny rule despite intermediation of best execution responsibilities. This should be better articulated. Mr Comerford's tick adjustments will be the way public exchanges assail that fortress for the public good, in time.

Mr Thomas Farley lost his way on listing standards which was understandable but otherwise handled most questions deftly. He raised an excellent point about not enjoying getting the blame for SEC deferment for proposals to committees. The exchanges expressed a preference for the SEC to do the work so the exchanges don't get the blame for unpopular decisions. It seems both the SEC and the exchanges would prefer the tenure that comes from being able to blame someone else. This perhaps needs a rebalance. Exchange SRO responsibilities were also contentious.

I disagreed with Mr Matt Lyons from The Capital Group on rebates but he put his case well. Mr Saluzzi agreed. Mr Katsuyama undermined this argument with his hysterical and harmful approach to demonising rebates with his silly kickback diatribe. Others made the strong case for the need for rebates for liquidity, especially for small to medium stocks.

There was a good conversation around the lack of companies going public. This certainly needs more attention but the difficulties in preventing the private market from gazumping the public market should not be underestimated. The "why bother" question is not easy. Forcing companies to be public is not realistic. Now that particular genie is out of the bottle, getting rid of impediments may help but it may be too late,
"the new higher level was not reduced when the fine was removed" [p 15]
The CAT received both support and disdain. I'm in a camp that says it is needed but I'm slightly horrified by the "Crazy CAT" implementation NMS has been lumbered with:
Crazy CAT approved by SEC.
Exchange resistance to market data feed expense mitigation in the face of overwhelming opposition looks like fair picking for regulatory reform. PFOF should be a thing that goes away but it needs to wait for the public market's ability to perform just as well, which for now it can not.

It is going to be difficult to progress, especially with evident support for the alternate facts in "Flash Boys" from some of the committee questioning. Disinformation has a long shadow.

Where there is civil debate, there is hope.

Happy trading,


Monday, 26 June 2017

Finra ATS Tier 1 statistical update

As a few things are afoot, it may be handy to get our heads around the current anatomy of the US ATS market. Let's meander through this dark corner.

We'll just look at the statistics for tier one stocks as these are the most timely reports.

There is no change to the relatively stable rankings of the top three pools. UBS's ATS and CS's Crossfinder remain way out in front. DB had a poor week with its position at #3 in the greatest peril for some time with both JP Morgan and Barclays being the closest to DB for some months.

Goldman Sach's transition to their new ATS has largely been completed with their newer platform rising ten places to #11 this week. KCG dropped three places to #14. LiquidNet H20 gained 4 spots. NYFX Cowen Exec Services dropped 6 places to #21.

In ATS news this week it was announced that Instinet is purchasing State Street's ATS. You can see Instinet's current pool is ranked tenth with 105M shares traded with State Street ranked twentieth. If they were combined, which is not being suggested yet, they would have rank of #9. The big difference between the two is that State Street's pool has an average trade size of 12,482 shares as compared to 229 for Instinet's current CBX pool.

DealerWeb (360,125) and LiquidNet (40,853) lead the average trade block sizes.

Luminex's paltry 5.3M shares traded, and fifth last ranking, clearly demonstrates that markets require diversity. Markets work despite the motivations of the participants. That's their ultimate beauty. Diversity matters and homogeneity risks growth. Luminex only managed 162 trades for the week. I'm not sure you need technology beyond a notebook and pen for that. At least the average block size at 32,879 was high, being the third largest. This emphasises that liquidity is a carefully orchestrated dance of mutual benefits. A dance of offer, parry, hedge, replenish. Quite the tango that is oft misunderstood as war rather than for being the carefully calibrated artistry that it truly is.

Rank ATS ATS T1 share % Volume Avg trade size
1 UBSA UBS ATS 17.61 486.2 M 172
2 CROS CROSSFINDER 13.97 385.6 M 189
3 DBAX SUPERX 7.14 197.2 M 195
4 MSPL MS POOL (ATS-4) 6.64 183.4 M 260
6 LATS BARCLAYS ATS ("LX") 5.90 163.0 M 214
7 EBXL LEVEL ATS 5.59 154.4 M 208
8 MLIX INSTINCT X 5.01 138.2 M 228
9 BIDS BIDS TRADING 4.47 123.5 M 788
11 SGMT GOLDMAN SACHS & CO. LLC 3.48 96.1 M 203
12 ITGP POSIT 3.47 95.9 M 308
13 KCGM KCG MATCHIT 3.31 91.3 M 184
14 MSTX MS TRAJECTORY CROSS (ATS-1) 2.15 59.3 M 177
15 XSTM CROSSSTREAM 1.58 43.6 M 391
16 DLTA DEALERWEB 1.45 40.0 M 360,125
18 CXCX CITI CROSS 1.13 31.2 M 230
19 LQNA LIQUIDNET H2O 0.92 25.5 M 17,565
22 LQNT LIQUIDNET ATS 0.86 23.7 M 40,853
23 XIST INSTINET CROSSING 0.64 17.8 M 5,196
24 PDQX CODA MARKETS, INC. 0.50 13.8 M 230
25 CBLC CITIBLOC 0.31 8.6 M 19,651
26 MSRP MS RETAIL POOL (ATS-6) 0.26 7.0 M 186
28 WDNX XE 0.05 1.3 M 1,636
29 AQUA AQUA 0.02 0.6 M 6,488
30 BCDX BARCLAYS DIRECTEX 0.01 0.2 M 29,471

(click to enlarge)

The top 5 pools represent over half the ATS volume traded. The top ten pools collective share has been steadily rising to the current accounting of three quarters of all ATS volume. This was assisted by IEX's dark pool transitioning to being the SEC's first dark public exchange which corresponds to the short period of the largest rise.

(click to enlarge)

The average trade size of the top 15 pools mainly resides in the minimal 100-300 shares per trade range with only XSTM CrossStream and BIDS being the consistent larger exceptions. The largest pool, UBS, typically has the smallest average trade size as you may see in the following chart. You may note the strange red line in the bottom right of the chart representing the new Goldman Sachs platform leaping into life.

(click to enlarge)

That previous chart makes it a bit hard to see if any of the top pools, apart from BIDS, have increased their average trade size. An alternative view of the top ten pools below shows their average trade size for the week compared to their average trade size over time, to make it easier to see variations in size compared to their own normal:

(click to enlarge)

Well, the size variation was meant to somewhat easier to understand in that chart for some strange definition of easier.

There does seem too many pools and exchanges. I can't help but wonder if there shouldn't be tighter policing of the proliferation by treating the NMS space more like radio spectrum and considering the venue space as a scarce resource. The bad old days of NYSE dominance showed one exchange to rule them all was not the best idea, but surely the US does not need more than forty exchanges and ATS pools.

I also remain of the belief that the SEC should carefully consider the two types of pools we see in this ATS mix. There is quite a different utility to a large block trading pool and a pool with a small average trade size. They are different beasts. Perhaps the SEC needs to explicitly partition their rule space for such species.

I'm not sure a small average trade sized pool with lots of volume should exist for many years if it is not a public exchange. I'm biased against such such parasitic pools due to their lack of participation in price discovery. Parasitic pools, like index funds, may have some utility but it should be clearly articulated what their efficiency or utility really is. It is not always clear what such low average trade size pools offer apart from being an embryonic step to being a public exchange. If there are some benefits gained by the low trade sized ATS pools due to easier rule enforcement then perhaps the rules for exchanges should be changed to allow the same efficiencies. If such rules aren't suitable for a public exchange, then perhaps they have no place for an ATS either.

Perhaps time limited ATS licenses should be granted for the low average trade sized ATS? Go big, or go home. Be an exchange in five years or stop clogging up NMS plumbing. All systems need a cleanse from time to time.

Happy trading,


OTC Transparency data is provided via and is copyrighted by FINRA 2017