The march 2011 ponder this puzzle was one of my favorites:

```A spaceship has 100,000 bits of storage, and one of these bits is known to be faulty. You can locate the faulty bit using agents that run on any given subset of bits and return "OK" if all of the bits are good and die if they encounter the faulty bit. It takes an agent one hour to run a query, regardless of the size of the subset, but an infinite number of agents can run simultaneously. You need to find the wrong bit in two hours.
Since we must decide, in advance, how many agents to send with the spaceship, we are interested in the following questions:
A. What is the minimal number of agents needed? (Bonus question: Find a formula for the number of agents needed for n bits and t hours).
B. Suppose we want to send enough agents to be able to repeat the same task a second time with the remaining agents (i.e., those who did not die during the first invocation). How many agents are needed in that scenario?
Update March 2nd: Different agents can access the same memory bit at the same time.```

I’ve written up a handful of these ponder-this puzzles and continue to see value in doing so, not just for my own record, but also to help people get better at problem solving.  I do this knowing that the ponder-this people publish a solution (which I’ve appended to this post), because the expanded path that I walk through is more helpful to those who might have problems, and it shows what I think is an honest look at what goes into understanding these things.

I do Strongly suggest trying the puzzle for yourself before reading my solution.

Enough with the prologue, let’s start:

Without considering any sort of concurrency (two agents testing the same bit at the same time):

we could send out 99,999 agents, assign 1 per bit, and see which one doesn’t come back, if they all come back it’s the bit that wasn’t checked.

we could send out 2 agents, 50,000 bits for each for the first hour, and then depending which one comes back we send out 49,999 agents to test those 50,000 bits, 1 per bit.  A total of 50,000 agents would be needed.

we could send out 2 agents, 33,333 bits each, 33,334 untested, so that will (in the first hour) narrow down the problem to 33,333 or 33,334 bits, either way in the second hour we will need to add 33,331 agents, totaling 33,333 agents.

we could send out 3 agents, 25k bits each, 25k untested, so that will(in the first hour) narrow down the problem to 25,000 bits, and will require 24,999 bits to be sent out in hour two to find the bad bit.  Totaling 25,000 agents needed.

So the first hour we partition the bits into smaller sets, the set from the agent that dies in the first hour will be examined 1-bit per agent in the second hour to find the bit.

let’s call the number of agents sent out in the first hour a1, and the number of agents sent out in the second hour a2.

you can see that as we increase the number of agents sent out in the first hour, we decrease the number of agents required for the second hour.  We know that the number of partitions created in the first hour is a1+1 (since we can narrow it down to the last partition if they all come back alive), and the size of each partition is 100,000/(a1+1).  In the second hour, a2 must be one less than the size of the remaining partition (in order to test all but one in order to finger the faulty bit).

Without worrying about the details about agents dying and uneven partition sizes, lets try to minimize the total number of agents, and since at most 1 agent dies in the first hour, the total number of agents sent out is approximately the greater of a1 or a2.

we know a2  = 100,000/(a1+1) – 1

let’s approximate a2 as = 100,000/a1 since the two ones are nearly insignificant.

so we want to minimize   max(a1,a2) which is roughly max(a1,100,000/a1)

as we increase a1, a2 decreases, for the entire range, so the minimal maximum of the two will be when they are equal to each other.  a1=100,000/a1 -> a1^2 = 100,000 -> a1  = sqrt(100,000) ~= 316

so as a rough estimate the minimal number of agents we will need is around 316.

Sending out 316 agents the first hour, making 316 partitions of 316, and then in the 2nd hour sending out the 315 agents that returned alive after the first hour, 1 per bit for the final hour to find the bit in the partition.

This can be solved exactly, and the details can be worked out, but you will find the answer to be very near this, I’m not going to labor over it because we have not even considered concurrency yet… so lets do that:

Lets look at a basic example of concurrency:

We have 100,000 bits and lets say we send out 2 agents in the first hour, without concurrency we can partition the space into 3 equal sets (each 33,333/4 bits).  With concurrency we can tell agent1 to check bit 1-50,000 agent2 to check 25,001-75,000, so that if only agent1 dies we know the problem bit is in 1-25k, if they both die we know its 25,001-50k, if only agent2 dies its 50,001-75k, and if both survive it’s the untouched partition of 75,001-100k, so instead of partitioning into 3 areas, concurrency allows us to use 2 agents to partition the bits into 4 sets.  The cost is that both bits could die.  As we increase the overlap of agents testing a given bit, the more agents could die in the worst case.  Without any concurrency the worst case is 1 death per hour.

When the extra precision worth the extra loss of agents?

Let’s take a look at what a number of agents can do in 1 hour.

 bit1 bit2 bit3 bit4 bit5 bit6 bit7 bit8 agent1 check check check check agent2 check check check check agent3 check check check check

The above chart is an example where 3 agents can cover 8 bits completely.  Looking at who comes back alive, we can determine exactly what bit is faulty.  The downside is the worst case of 3 agent deaths.

here’s a table of what a number of agents can do in 1 hour:

 total # of agents min deaths (min concurrency) max deaths(max concurrency) size of set which the agents can determine the faulty bit.  (w/concurrency) (up to) bits covered per agent death (worst case) w/concurrency size of set which the agents can solve. w/o concurrency (max death = 1) bits covered per agent death (worst case) w/o concurrency 1 0 1 2 2 2 2 2 0 2 4 2 3 3 3 0 3 8 ~2.67 4 4 4 0 4 16 4 5 5 5 0 5 32 6.4 6 6 6 0 6 64 ~10.67 7 7 7 0 7 128 ~18.29 8 8 8 0 8 256 32 9 9 9 0 9 512 56.89 10 10 a 0 a 2^a (2^a)/a a+1 a+1

So even per death, (let alone per total agents required), concurrency is far better than non-concurrency for a reasonably large set of bits.

From this we can simply set the size of the set which agents can solve in an hour (2^a) equal to 100,000, and solve for a.  a needs to be 17 before 2^a is larger than 100,000… So 17 agents can find the faulty bit in 1 hour.

But we have 2 hours at our disposal, so maybe we can do better then 17.

Here again is the chart showing what 3 agents can do with concurrency, solving 8 bits in 1 round:

 bit1 bit2 bit3 bit4 bit5 bit6 bit7 bit8 agent1 check check check check agent2 check check check check agent3 check check check check

Now, instead of individual bits, imagine 8 mutually exclusive sets of bits:  (check for a set, means the agent checks every bit in the set)

 bit set 1 bit set 2 bit set 3 bit set 4 bit set 5 bit set 6 bit set 7 bit set 8 agent1 check check check check agent2 check check check check agent3 check check check check

So just as 3 agents could find the faulty bit of 8 bits, they can also isolate the set with the faulty bit out of 8 sets…  So the chart comparing bit coverage of concurrency verses non-concurrency is also relevant for partitions, not just individual bits, here it is again with only a few words changed:

 total # of agents min deaths (min concurrency) max deaths(max concurrency) # of sets/partitions that can be narrowed down to 1.(w/concurrency) (up to) sets covered per agent death (worst case) w/concurrency # of sets/partitions that can be narrowed down to 1. w/o concurrency (max death = 1) sets covered per agent death (worst case) w/o concurrency 1 0 1 2 2 2 2 2 0 2 4 2 3 3 3 0 3 8 ~2.67 4 4 4 0 4 16 4 5 5 5 0 5 32 6.4 6 6 6 0 6 64 ~10.67 7 7 7 0 7 128 ~18.29 8 8 8 0 8 256 32 9 9 9 0 9 512 56.89 10 10 a 0 a 2^a (2^a)/a a+1 a+1

So lets use the fact that we have two hours to split the problem into, it certainly helped a lot without using concurrency, lets see if it can help us with it.

Similar to earlier it seems natural that the best way to split the problem is to partition into 316 groups of about 316…  So we would need 9 agents for the first hour, to narrow it down to 1 set, and then another 9 agents for the second hour to narrow it down to 1 bit.  Total of 18 agents.

This is worse than what we can do in one hour, and looking at how this method works, lets try to exploit the most efficient set size for this method, which is 2^a, so lets try to partition into groups of 256 bits, there will be 391 groups. We know for the second hour we will need 8 agents to find the bit from 256, and we will need 9 agents in the first hour to select the partition out of the 391.  This totals 17 agents required, and is only as good as what we can do in 1 hour.

There are two things that come to mind when thinking about possible improvements.  One is if we send out 9 agents to partition the bits, then another 8 the 2nd hour, that means the 2nd group of 8 are simply waiting idle, for the first hour to end.  They could possibly be utilized without risking having 8 in the 2nd hour?  Also in the best case (least deaths), none of the 9 agents die in the first hour, and they are unused in the 2nd hour, seems inefficient.

Addressing the first thought, yes lets see how many partitions we can narrow down to one, assuming we limit the max concurrency so that we have 8 agents surviving in the worst case:

 total # of agents min deaths (min concurrency) max deaths(max concurrency) # of sets/partitions that can be narrowed down to 1.(w/concurrency) (up to) partitions covered per agent death (worst case) w/concurrency # of sets/partitions that can be narrowed down to 1. w/o concurrency (max death = 1) partitions covered per agent death (worst case) w/o concurrency 1 n/a n/a n/a n/a 2 n/a n/a n/a n/a 3 n/a n/a n/a n/a 4 n/a n/a n/a n/a 5 n/a n/a n/a n/a 6 n/a n/a n/a n/a 7 n/a n/a n/a n/a 8 0 0 1 n/a 1 n/a 9 0 1 10 10 10 10 10 0 2 56 28 11 11 11 0 3 232 77.33 12 12 12 0 4 794 198.5 13 13 13 0 5 2380 476 14 14 14 0 6 6476 1079.33 15 15 a 0 a-8 sum(ncr(a,k),k=0..(a-8)) (2^a)/a a+1 a+1

note that ncr(n,r) is a function which gives the number of ways to select r items from n items where order of selection is irrelevant, and duplicate selection is not allowed.  You may have seen it in combinatorics as a combination, “n choose r” (nCr).  Google it if its not familiar.

So we can use 12 agents to solve 794 bits/sets with at least 8 agents surviving.  So in our problem of 100,000 bits we can use 12 agents to split it into 391 partitions of 256 bits in the first hour, and then use 8 that come back alive in the 2nd hour to isolate the individual bit from the 256.  With this method a total of 12 agents are required.

What about my 2nd thought, about how in the best case (or any non-worst case) we don’t utilize the agents that survive in the first hour.

If the faulty bit is in the partition that is tested by no agents, or only a few agents in the first hour, we will be left with more agents in the 2nd hour than necessary to finish.  So we actually have enough agents in these non-worst-cases to scan a larger partition in the second hour, so maybe we should make these partitions larger?  That might result in less total agents required to cover 100,000 bits.

How much bigger?

Well, we know in the last hour we will need enough agents alive to solve the size of the partition with the faulty bit.  If the size of the partition is S, and we know ‘a’ agents can isolate a faulty bit out of 2^a, then 2^a>=S, and solving for a, a>=lg(S).

so for a given partition, if the number of agents that don’t check it is An, that will be the number that will survive, so if An agents don’t check a partition, it can be as large as 2^An bits in size.

so the partition that is untested in hour1 can be 2^a large, while the partitions that are test by 1 agent, can be 2^(a-1) bits in size, and so on…  lets create a chart with 3 agents:

 partition 1 partition 2 partition 3 partition 4 partition 5 partition 6 partition 7 partition 8 agent1 check check check check agent2 check check check check agent3 check check check check

now lets look at what the size of each partition can be, in order for the bit to be isolated in the next hour by the surviving agents.

 # of agents that checked Number of agents that will survive if the faulty bit exists in this partitoin. (# if agents that didn’t check it) size that the partition should be (up to) in bits.  This is the number of bits which the surviving agents could solve in 1 hour. partition 1 0 3 8 partition 2 1 2 4 partition 3 1 2 4 partition 4 1 2 4 partition 5 2 1 2 partition 6 2 1 2 partition 7 2 1 2 partition 8 3 0 1

So the total number of bits that 3 agents could process to find the 1 faulty bit in 2 hours is 27.

‘a’ agents can split up into 2^a partitions, and each partitions size varies depending on the number of agents testing it.

So for a given number of agents ‘a’, there will be

 # of agents which check a partition # of agents who consequently didn’t check. (# that will survive if faulty bit is there) max size of each partition (based on the number that will survive) number of partitions with this many agents checking. Total bits covered by all partitions scanned by the number of agents characterized in each row. 0 a-0 2^(a-0) ncr(a,0) 2^(a-0)*ncr(a,0) 1 a-1 2^(a-1) ncr(a,1) 2^(a-1)*ncr(a,1) 2 a-2 2^(a-2) ncr(a,2) 2^(a-2)*ncr(a,2) 3 a-3 2^(a-3) ncr(a,3) 2^(a-3)*ncr(a,3) k a-k 2^(a-k) ncr(a,a-k) 2^(a-k)*ncr(a,a-k) a-2 a-(a-2) 2^(a-(a-2)) ncr(a,a-(a-2)) 2^(a-(a-2))*ncr(a,a-(a-2)) a-1 a-(a-1) 2^(a-(a-1)) ncr(a,a-(a-1)) 2^(a-(a-1))*ncr(a,a-(a-1)) a a-a 2^(a-a) ncr(a,a-a) 2^(a-a)*ncr(a,a-a)

so in total we will have to add the column of the chart that depicts the total bits covered by all partitions characterized by each row.

That is Sum( 2^(a-k) * ncr(a,a-k) , k=0..a ) , I can also re-write this as Sum( 2^k * ncr(a,k), k=0..a) which actually simplifies to 3^a.  So in two hours, the number of bits that ‘a’ agents can solve is 3^a.

 number of agents number of bits solvable in 2 hours 1 3 2 9 3 27 4 81 5 243 6 729 7 2187 8 6561 9 19683 10 59049 11 177147 12 531441 a 3^a

So in 2 hours we can solve up to 177,147 bits with 11 agents, and certainly 100,000 which is question A of this puzzle.

It also asks for a general formula for n bits and t hours….  well we now know ‘a’ agents can cover 3^a bits with 2 hours, 2^a bits with 2 hours (as mentioned earlier), and presumably (t+1)^a bits with t hours.  solving for a,

(t+1)^a = nbits, a = log(nbits)/log(t+1), and if its not an integer, we need to round up, so technically

agents = ceiling( log( number of bits) / log( number of hours + 1) )

that is the bonus part of question A.

Question B asks, what if we need to accomplish this task of finding 1 bad bit out of 100,000 twice… well we know for the final attempt we will need 11 agents alive, so lets figure out how we can do it once, with a worst case of 11 agents surviving.  We just need to figure out how many are needed when we restrict the concurrency to 11 less then the total for the entire 2-hour process.

so instead of Sum( 2^(a-k) * ncr(a,a-k) , k=0..a ) for the number of bits a agents can figure out in two hours, during the first task ‘a’ agents will only be able to process this sum( ncr(a, k) * sum(ncr(k, h), h = 11 .. k) , k = 11 .. a ) many bits with ‘a’ agents, because we use less concurrency, since 11 need to survive.

So to answer question B, we simply find the minimum a, such that:

sum(ncr(a, k)*(sum(ncr(k, h), h = 11 .. k)), k = 11 .. a) > 10^5

and that solution is a=16.

So we need 16 agents to be able to accomplish the task twice.

Admittedly this walkthrough of question B is brief and maybe rushed,  but if you followed A, you should be able to connect the dots and fill in between the lines.

Fun-stuff.

IBM published their solution:

```Using n agents in t hours, we can discover a faulty bit from a set of (t+1)^n bits.
We prove that using induction on t. The base (t=0) is trivial. For larger t's, we write the bit addresses in base t+1 and give the i-th agent to test the numbers whose i-th digit is t. After an hour, we are left with k living agents, n-k known digits, and k digits with only t possible values, which suits the induction.
If we have two hours to find the faulty bit (t=2), we need 11 (floor(log_3 100,000)) agents to solve part A. One possible solution is 3^11, which is 77,147 more than 100,000. 77,147 is the dollar sum racked up by IBM's Watson while winning Jeopardy! last month.
Having more agents can help us lose less of them. For example, if we have 100,000 agents, we can find the faulty bit in one hour while losing only one agent.
Careful calculation reveals that using 16 agents, we can choose a set of 100,000 16-digit numbers in base 3 that do not contain the digit 2 more than five times. This guarantees that at least 11 agents would live for the second part.```