I’ve had a very deep technical discussion with Elrond guys concerning their consensus, state of their code etc. we didn’t see much of the code but we got a feel of what they are doing now, how it is different from the whitepaper and the prototype they had in August.
First of all, what is Elrond: it’s a public proof of stake based homogeneous state sharded blockchain with EVM for smart contracts. They are using pBFT for consensus with a twist: not all shard nodes participate in consensus round but just a randomly selected subset, so called «consensus group» which leads to a faster consensus but has security implications that need to be addressed. Let’s unpack how exactly they are addressed in Elrond.
What is a state sharded blockchain? The term comes from conventional distributed system engineering and means that total database state (in that case all the account balances and smart contract data) is divided in chunks, and a single node only has to store one chunk. So if total blockchain size is, say, 100 TB, then in a state sharded model with 1000 shards any single node only has to store about 100 Gb. Transaction processing can be sharded too: global 10 000 TPS can be sharded to become 10 TPS for any single node in that example.
Sharding is a very common technique in conventional databases, where it is much simpler to implement. The reason for that is database usually don’t presume there are malicious nodes. Nodes are presumed to be well-behaved and only fail due to external circumstances like power outage or network problems. When any given node can actively try to corrupt data for nefarious purpose, safe sharding gets a lot more difficult to engineer.
There are a lot of projects that aim to bring sharding to blockchains as a L1 scaling solution: eth2, zilliqa, near protocol, quarkchain to name a few. Most are homogeneous: that means that individual shards all share the same consensus, virtual machine, node software etc. Elrond is doing that too.
The main security implication for a sharded blockchain is that a single corrupted shard can lead to double spend/currency forging that affects the whole blockchain, and individual shards are less secure than a monolithic blockchain of the same node count. The main UX implication is that individual shards are essentially data islands: interaction between accounts in different shards is slower, more expensive, less secure. Sharded eth2 is not like Ethereum but 256 times bigger. It’s more like 256 Ethereum chains construction bolted together correctly.
Homogeneous state sharding blockchains in development ultimately share the same general shape. Blockchain of a total of N nodes is divided into K shards. There is some random beacon that distributes nodes to shards, nodes periodically rotate. When assigned to the shard, nodes uses some consensus algorithm to advance that shard’s blockchain. There’s some method for intershard transactions. There’s some fallback for the case when an individual shard goes rogue. What differs are the exact protocols of random number generation, consensus algorithms, intershard transaction mechanics. Let’s dig into Elrond.
Most notable part is Elrond’s consensus algorithm that doubles as a random beacon. Original idea was for a block proposer to select a number of nodes among the consensus group who would combine their their signature in a Schnorr multisig that would double as a block signature and a random beacon. Random beacon would be then used to select next round’s block proposer and consensus group. The described beacon would satisfy unpredictability and unbiasability properties, needed for consensus security but would seriously compromise liveness. A single malicious node in a singing group (2/3+ subset of consensus group) would force block generation to fail and restart.
Since then and since our call Elrond changed signature scheme to BLS multisig and random beacon from Schnorr multisig to a single BLS signature by a designated block producer. That scheme looks a lot better: liveness implications are gone and random beacon properties are only slightly worse (a given block producer can opt to fail to sign, passing the torch to the next one in line). That makes a beacon slightly biasable (bad guys can get a few retries) which could be a deal-breaker for applications like games of chance or lotteries but is more or less acceptable for cryptographic sortition with right parameters. It doesn’t really matter if chances of shard capture by an attacker are 1e-18 or 2e-18.
Elrond also proposed to add some local rating systems into the mix so that a block proposer would only choose nodes he knows are good to sign the message. That doesn’t sound like a good idea. Reputation-based system are either too weak to protect from malicious adversary or strong enough for them to use it for griefing attacks. Consensus protocols that assume adaptive adversary (i.e. identity and reputation theft among validators) are much stronger. Thankfully, the latest protocol version doesn’t need reputation system to work and it’s deprecated now.
For intershard communication Elrond now plans to use more or less the same scheme eth2 is using: a source shard commits an intention to do a cross-shard transaction, a destination shard waits until it gets anchored on the «main» chain of Elrond, the so-called metachain (that also shuffles validators among shards), and commits a successful landing in a destination shard. That’s a robust scheme that doesn’t depend on metachain throughput too much. At the moment of a call it lacked any fallback in case of a malicious shard that starts to doublespend or mint fresh tokens out of thin air. Elrond guys think it’s not that likely but if the only answer to one shard breaking for basically two consensus rounds is a hardfork that sure sounds fragile. Elrond engineers came up with two ideas since then. One is using ZK rollups which isn’t really find suitable (not only it only works on simple monetary transfers as opposed to Elrond’s full intershard smart contracting, it also doesn’t have great latency due to pretty hard computations involved). Second is using economic game to amend double spends and similar bad actions with the slashing malicious validators — and that is much more promising approach. For a fully functional game economic effect of intershard smart contract transactions must be limited to consensus group’s collective stake and there should be truebit-like dispute to challenge/amend bad blocks.
The same goes for their proposed adaptive sharding, at least in whitepaper. Main idea was that when a shard gets too busy it’s automatically divided in two. That could probably be implemented safely enough with a simple payment blockchain but not for Eth-like smart contract one. Contracts relying on intrashard speed and security assumptions could be wrecked by sudden migration. Anything like Ethereum’s DeFi, or even a simple multisig, would become very complicated in that model where you can’t rely on oracle/token/library contract being in the same chunk of state as the contracts the depend on it. Thankfully, since then Elrond took a more conservative approach: shards are only spawned when there’s enough validators to support another one, and smart contracts only migrate in bulk: no static dependency is broken by the migration. There might be some issues with delegatecall-based smart contracts but those are a lot more easy to isolate and harder to exploit.
Another two things I’d like to touch upon is using a separate local random beacon for each shard instead of a global one in metachain, and using consensus groups vs. just using smaller shards. we think that using metachain beacon for all shards would be more secure, though it has its drawbacks (i.e. all shards liveness would be dependent on metashard liveness).
We haven’t seen a lot of code, just a few pages on a shared screen, so we can’t comment on its readiness or quality but we can tell that current version of Elrond is written in Go, instead of a prototype in Java. We were shown a demo that was blazingly fast (about 10k TPS) but I’m not sure how real those TPS are, as in, what a transaction includes (team assured signature checking is included at least), what is network latency (test nodes were geographically distributed between different AWS datacenter which is kinda realistic for public blockchains :), what would performance be when there is enough blockchain state to not fit in processor L2 cache etc. etc.