Merkle Trees - How to Pull Hashes of Leaves


#1

Hi all and thanks for taking the time to read my questions,

I’ve been learning about Merkle Trees and I think I have a pretty good idea of the foundations of a binary Merkle tree at least, but I just can’t wrap my head around one part. Since only the root hash is recorded on the block, yet validation of any transaction requires a proof containing hashes from the branches that led it to the root, how do we go about getting the hashes for the leaves of the root and the leaves that proceed them? Where is the data structure stored, and how do we know who has the data to provide it in entirety? It makes sense that transaction x can plausibly know its own index within the tree, but how are we to know what path to take up the tree without knowing the entire tree?

Thanks again for taking the time to read, hopefully this isn’t a dreadfully stupid question!
gamedazed


#2

You are asking all the right questions!

To review, a binary Merkle tree is structured as follows: for some set of data S with distinct elements s_i, you put each s_i as one leaf node and then pad the sequence until it is even. You may see some differences between implementations here; some will use the hash of s_i in place of s_i and there are a variety of ways to pad the sequence until it is even. (aside: there was a subtle attack on Bitcoin’s Merkle tree construction that had to do with how they padded the tree which could in theory lead to some transaction trickery!). Now with our constructed sequence of pairs, we proceed recursively: for each pair, we create a parent node that points to each child in the pair and has a value that is equal to the hash of the concatenation of each child’s value. We do this until we end up with a single root node that is in some sense a “fingerprint” for the entire tree as changing any node below the root will change the rest of the tree along that path from the node to the root. This property enables us to create Merkle proofs where we can convince someone with just the root hash of the Merkle tree that a particular element s_i is in the tree. We do this by providing the element s_i and all the hashes in the tree required to build the path from the element to the root node. For a O(log n) number of elements (meaning proofs are compact), a verifier can use this proof to verify the claim that s_i is indeed in the set S by confirming all the hashes actually build the tree claimed in the proof and that the resulting root matches the root they know to be correct.

Simply put, you have to ask someone who has them. One setting where Merkle trees are useful is in light client protocols. A light client is one who doesn’t keep the full blockchain themselves, but a subset of the state. A light client still wants some security guarantees even though they can’t validate the full blockchain themselves and one way they can do this is by using the Merkle proofs just described. By maintaining recent block headers, a light client can query a full node (or ideally a randomized subset of the network) for Merkle proofs of the data they are interested in obtaining. The light client can then verify the Merkle proofs to know if the data they get back is from an honest full node or not.

Similar to the first question, the data is stored with whoever cares enough to maintain it. Some nodes do this so they have a chance at pecuniary rewards (i.e. proof-of-work mining) and some nodes do it so they can be self-sovereign (i.e. they can validate the blockchain’s data without relying on a third-party). The bottom line is there are a variety of incentives at play on a blockchain and different nodes do different things for different reasons. The blockchain works if, enough of the time, enough of the data is available that anyone who needs some state can get to it. In general, we don’t know who was the data, esp. in its entirety. For the most part, we can safely assume we can connect to existing full nodes and get the data we need. There are various mechanisms in each protocol to allow for a new node to bootstrap and find existing nodes (e.g. there are Bitcoin seed nodes you can talk to in order to bootstrap your knowledge of the network). It turns out that the full scope of this problem is actually a research question with unknown solution. You can start to see how this becomes problematic in a sharded blockchain where nodes may need data from an earlier shard that current nodes have no reason to keep track of. You can read more here: https://github.com/ethereum/research/wiki/A-note-on-data-availability-and-erasure-coding


#3

Thank you very much! That makes perfect sense; in essence we create an ecosystem that ensures parties have an interest in providing the full Merkle Tree.

I’ve seen the presumption of knowing the block we’re checking multiple times already, that makes more sense now, and we’re essentially looking for the hash of our transaction at one of the leaves. If we don’t know what our block is, we can download blocks until we see a tree containing the hash of transaction t, and by downloading trees and storing them we can earn some reward by offering them when requested. Is it correct to assume that the query is made with the block as an argument?


#4

As a thought to the link you posted:

In sharding we have block validators and block generators(I don’t remember the exact terminology), wouldn’t it be reasonable to have the block generators pass the Merkle trees for their blocks they generate to the validator node, the validators and generators each create hashes of the full tree based on the transactions they passed. If that were the case, wouldn’t you be able to prove
A) that each validator provided an unchanged set of transactions
B) that any invalid proofs submitted would be necessarily linked to the validator creating the invalid tree

I’m not expecting that I’m the genius who figures out the fix to this problem, merely a journeying lover of learning seeking the errors in his ways. Please guide me, oh wise ones.


#5

This is kind of an implementation detail or a concern of a particular implementation.

And to clarify the idea in more exotic settings is to have some incentive for nodes to host data. But in most blockchains there is no in-protocol incentive to store the history (beyond what you need to mine). This is in part due to the fact that this problem is actually really hard to solve well.


#6

This may be outdated but some recent language is a collator and a proposer.

And to your suggestion, a few things:

First, the root hash of a Merkle tree is the identifier for the full tree, for the points we looked at above. So you can just send the root hash to a peer if you just want to commit to a certain tree, you don’t need to hash the tree somehow as the Merkleization already does that effectively

Next, as far as I understand what you are suggesting that is more or less already part of the existing block validation process. Let’s review: in Ethereum today a block is a list of transactions and a block header. The block header contains several pieces of metadata containing three Merkle tree roots: the root for the transaction Merkle tree, the root for the receipts list and the root for the state tree. The transaction root represents the transactions in the block. The receipts root represents the resulting receipts (result of running a particular transaction against the prior state). The state root represents the global state tree after applying the transactions in the block. These each serve as “commitments” in the sense that if another node derives conflicting data they consider the entire block invalid. They moreover facilitate light client protocols such that you can just track block headers, rather than full blocks, and just pull down the data you need from your peers which are verified by Merkle proofs.

You have to add some additional constructs so that this works in a sharded context where every node by definition has an incomplete view of the network. The exact set of constructs is an outstanding research problem as we build Ethereum 2.0.