You are asking all the right questions!

To review, a binary Merkle tree is structured as follows: for some set of data S with distinct elements `s_i`

, you put each `s_i`

as one leaf node and then pad the sequence until it is even. You may see some differences between implementations here; some will use the hash of `s_i`

in place of `s_i`

and there are a variety of ways to pad the sequence until it is even. (aside: there was a subtle attack on Bitcoin’s Merkle tree construction that had to do with how they padded the tree which could in theory lead to some transaction trickery!). Now with our constructed sequence of pairs, we proceed recursively: for each pair, we create a parent node that points to each child in the pair and has a value that is equal to the hash of the concatenation of each child’s value. We do this until we end up with a single root node that is in some sense a “fingerprint” for the entire tree as changing any node below the root will change the rest of the tree along that path from the node to the root. This property enables us to create Merkle proofs where we can convince someone with just the root hash of the Merkle tree that a particular element `s_i`

is in the tree. We do this by providing the element `s_i`

and all the hashes in the tree required to build the path from the element to the root node. For a O(log n) number of elements (meaning proofs are compact), a verifier can use this proof to verify the claim that `s_i`

is indeed in the set S by confirming all the hashes actually build the tree claimed in the proof and that the resulting root matches the root they know to be correct.

Simply put, you have to ask someone who has them. One setting where Merkle trees are useful is in light client protocols. A light client is one who doesn’t keep the full blockchain themselves, but a subset of the state. A light client still wants some security guarantees even though they can’t validate the full blockchain themselves and one way they can do this is by using the Merkle proofs just described. By maintaining recent block headers, a light client can query a full node (or ideally a randomized subset of the network) for Merkle proofs of the data they are interested in obtaining. The light client can then verify the Merkle proofs to know if the data they get back is from an honest full node or not.

Similar to the first question, the data is stored with whoever cares enough to maintain it. Some nodes do this so they have a chance at pecuniary rewards (i.e. proof-of-work mining) and some nodes do it so they can be self-sovereign (i.e. they can validate the blockchain’s data without relying on a third-party). The bottom line is there are a variety of incentives at play on a blockchain and different nodes do different things for different reasons. The blockchain works if, enough of the time, enough of the data is available that anyone who needs some state can get to it. In general, we don’t know who was the data, esp. in its entirety. For the most part, we can safely assume we can connect to existing full nodes and get the data we need. There are various mechanisms in each protocol to allow for a new node to bootstrap and find existing nodes (e.g. there are Bitcoin seed nodes you can talk to in order to bootstrap your knowledge of the network). It turns out that the full scope of this problem is actually a research question with unknown solution. You can start to see how this becomes problematic in a sharded blockchain where nodes may need data from an earlier shard that current nodes have no reason to keep track of. You can read more here: https://github.com/ethereum/research/wiki/A-note-on-data-availability-and-erasure-coding