# #word problem, but first: why I’m an idiot:

So I’ve been working on this little puzzle:

“If the integers from 1 to 999,999,999 are written as words, sorted alphabetically, and concatenated, what is the 51 billionth letter?” from ita software

I outlined the framework of an algorithmic solution months ago when I first saw it, and i solved it with maple, but never got to implement it using a ‘proper’ programming language.

This problem came to mind again recently as a fun excuse to get my hands dirty with a computer programming language called python.

I’ll start at the end, and say I was totally frustrated, because my seemingly perfect algorithm gave the wrong answer, in spite of passing all sorts of verification tests… I finally got to the point I was so convinced that unless I failed to spell the 27 numbers in my native language’s naming scheme, the problem’s hint must have been wrong…. >>>

so 40 is ‘forty’, not ‘fourty’. and then everything worked perfectly…

Question: Why not just create all the numbers as words, alphabetize them, and pick the 51st billionth letter?

Answer: Creating, then alphabetizing 1 billion strings, some as long as 93 letters, would take awhile.

we can sort a list of strings in O(n* log(n) ) where n is the number of strings/integers, but in our case the size of the strings is also related to n.

note the max number of letters / words / number of digits of the first n integers is ~ O(log(n))

example: all numbers less than a billion (10^9) ~ log(10^9)= 9 digits.

**The traditional brute force solution might be possible with time complexity O(n*(log(n))^2) and memory requirement ~ O(n)**

**my solution has a time complexity of O( (log(n))^3) and memory requirement ~ O(log(n)) <-to store the list of names of numbers (‘thousand’,’million’,’billion’, ‘trillion’…)**

Let’s Start:

The number which has this 51st billionth character only has at most 14 words, so if we can narrow it down word by word, to where the 51st billionth word is, it would be really fast.

all the numbers that start with ‘five’ will be grouped together when listed alphabetically, so as I’m counting this list, it would be nice if I didn’t have to count all of them individually, especially when i dont care about the order of the numbers if im just summing them together

let’s see there’s:

five

five hundred x x

five thousand x x x

five hundred x x thousand x x x

five million x x x x x x

five hundred x x million x x x x x x

alphabetically they will be:

five

five hundred xx or five hundred xx xxx or five hundred xx xxx xxx

five million x x x x x x

five thousand x x x

lets consider for a moment the “five million xxx xxx”,

representing numbers 5,000,000 to 5,999,999 , thats 1,000,000 numbers.

so the words “five million” will occur a million times in these numbers that start with “five million”

there are 4 letters in ‘five’ and 7 letters in ‘million’ so that’ is 11 letters.

so there are 11 million letters involved with the first 2 words of all numbers starting with “five million”

what bout the xxx xxx? well it would be nice if we knew the total letters of all numbers w/ 6 or fewer digits?, wouldn’t it???… lets start small.

note: the word ‘zero’ will not exist in this list, because, for example, for 20 we say twenty, not twenty zero.

the sum of digits of all numbers of 1 digit, well its just the number of letters in this list:

n1=[“one”,”two”,”three”,”four”,”five”,”six”,”seven”,”eight”,”nine”]

so sumn[1]=number_of_letters(n1)=36

so the number of letters in single digit positive integers is 36

now for 2 digit numbers:

well for each word in

n2=[“twenty”,”thirty”,”forty”,”fifty”,”sixty”,”seventy”,”eighty”,”ninety”]

say for example, “thirty”

we have

30 thirty (note we wouldn’t write ‘thirty zero’)

31 thirty one

32 thirty two

33 thirty three

34 thirty four

35 thirty five

36 thirty six

37 thirty seven

38 thirty eight

39 thirty nine

so as we see here, each word (and hence each letter) in n2 will occur 10 times, and for each of the 8 words, each n1 will occur once.

so lets count this, 10*number_of_letters_in(n2)+8*number_of_letters_in(n1)

number_of_letters_in(n1) is sumn[1], so, so far we can count 10*number_of_letters_in(n2)+8*sumn[1]

but lets not forget about the other two digit numbers:

n2t=[“ten”,”eleven”,”twelve”,”thirteen”,”fourteen”,”fifteen”,”sixteen”,”seventeen”,”eighteen”,”nineteen”]

each one of these will occur once, so in total

sumn[2]=10*number_of_letters_in(n2)+8*sumn[1]+number_of_letters_in(n2t)

since n2 contains 46 letters, n2t contains 70, and sum[1] we already computed to be 36:

the number of letters in all 2-digit numbers is: sumn[2]= 818

the number of letters in all numbers with 2 or fewer digits is sumx[2]=sumn[2]+sumn[1] = 854

I’ll illustrate one more:

for 3 digits again we have choices in n1 for the leading word.

n1=[“one”,”two”,”three”,”four”,”five”,”six”,”seven”,”eight”,”nine”]

(because you can’t lead a 3 digit number with “fifteen” or “fifty”, but you can with “five”

so lets pick ‘five’ to examine:

500

501

502

503

… so there are far too many to think about, but clearly we will have 500 to 599, which is 100 numbers,

so ‘five’ ‘hundred’ will occur 100 times, and then we have to count all the letters in the numbers 01-99,

but actually we already did, sumx[2] is the number of letters in positive integers of 2 digits or less..

there are 9 words in n1, so ‘hundred’ will occur 9*100*number_of_letters_in(‘hundred’), also sumx[2] will need to be counted 9 times, one for each leading number in n1.

sumn[3]= 9*100*number_of_letters_in(‘hundred’)+100*number_of_letters_in(n1)+9*sumx[2]

since hundred has 7 letters, n1 contains 36 letters, and sumx[2]=854:

sumn[3]= 17586

meaning there are 17,586 letters involved in all 3 digit numbers.

sumx[3]=sumn[3]+sumx[2] = 18440

meaning the number of letters in numbers with 3 or less digits is 18,440

you can see (hopefully) that we can find sumn[n] with a little bit of logic, with the location of n, considering if the numbers are going to need n1 or n2/n2t for a leading digit and the value of sum[n-1]

so if i were to write sumn(n) as a function it would probably use recursion, but instead ill just built up from sum[1] and store the values of sum[1] through sumn[9].

actually i really only care about sumx[n], where we take the number of letters in numbers n digits or less:

here are the values of sumx[0] to sumx[9]

[0, 36, 854, 18440, 292400, 3490000, 44872000, 547720000, 6034200000, 70305000000]

this is done with the python code below:

note: ‘len()’ is a magnitude function, measuring number of elements in a list, or number of letters in a word.

note2: x+=y adds y to x just like x=x+y does.

n1=["one","two","three","four","five","six","seven","eight","nine"] n2t=["ten","eleven","twelve","thirteen","fourteen","fifteen","sixteen","seventeen","eighteen","nineteen"] n2=["twenty","thirty","forty","fifty","sixty","seventy","eighty","ninety"]

def sumx9(): """sumn is the array containing the number of chars for all postive integers of n digits""" sumn=[0] """sumx is th array containing the number of chars for all positive integers of <=n digits""" sumx=[0] for p in range(1,10): tempsum=0 if p in (1,3,4,6,7,9): m=n1 else: m=n2 "the following is about the teen numbers" tempsum+=len(n2t)*sumx[p-2] for n in range(0,len(n2t)): tempsum+=len(n2t[n])*10**(p-2) """takes the smaller numbers' characters and adds it for each number in m""" tempsum+=len(m)*sumx[p-1] for n in range(0,len(m)): tempsum+=len(m[n])*10**(p-1) """ the code below: this will count all the hundred/thousand/million words not included in the smaller numbers like 400,021. This number has a thousand, but 21 would not include the thousand, so we must include it. These are only needed in rare cases since most numbers like 401,021 or 482,021, the number beyond the leading digit carry the hundred/thousand/million words in themselves. 1,021 and 82,021 respectively both include the word thousand, and we don't wanna double count. """ if p in (3,6,9): tempsum+=len(m)*len('hundred')*10**(p-1) if p in (4,5,6): tempsum+=len(m)*len('thousand')*10**3 if p ==5: tempsum+=len(n2t)*len('thousand')*10**3 if p in (7,8,9): tempsum+=len(m)*len('million')*10**6 if p ==8: tempsum+=len(n2t)*len('million')*10**6 sumn.append(tempsum) sumx.append(sumx[p-1]+tempsum) """ ("the sum of the characters of all ",p,"digit numbers is",sumn[p]) print("the sum of the characters of all <=",p,"digit numbers is",sumnsum[p]) """ return sumx

so getting back to what I was saying before, how many letters are involved in all numbers starting with ‘five’ ‘million’

well we know there are 11 million letters involved with the words ‘five’ and ‘million’ since the pair will be written a million times, and what about the ~million different 6 (or less) digit numbers, since we are adding them up, we just need number of letters in all those numbers, and we do, sumx[6] is the number of letters in positive integers of 6 or fewer digits.

so the number of letters involved with numbers starting with ‘five’ ‘million’ is:

10^6*number_of_letters_in(‘five’,’million’)+sumx(6) = 11*10^11+44872000 = 55872000

see how easy that becomes?

so now we can do all of them

five = 4

five hundred x x + five hundred x x million x x x x x x + five hundred x x thousand x x x= 7145799954

five million x x x x x x = 55872000

five thousand x x x = 30440

so the total number of characters of all numbers starting with 5 is 7,201,702,398

I have sort of glossed over a few details here, but hopefully you can fill in the blanks, or read the code:

this function counts all the letters of all words beginning with some input list of strings:

"""countpossibles returns the total characters for all <=9 digit numbers . starting with the words defined in order with the input string list """ def countpossibles(words2): sumx=sumx9() tempsum=0 word='' words=[] if words2==[]: tempsum=sumx[9] else: nextwords__=nextwords(words2) for nn in range(0,len(words2)-1): word+=words2[nn] words+=[words2[nn]] maxx=maxplaceofnextdigit(words) testword=words2[len(words2)-1] if testword is "": tempsum+=len(word) elif testword is "hundred": tempsum+=(len(word)+len('hundred'))*10**2+sumx[2] if maxx>=5: tempsum+=(len(word)+len('hundred'))*10**5+sumx[5]+len('thousand')*10**3 if maxx>=8: tempsum+=(len(word)+len('hundred'))*10**8+sumx[8]+len('million')*10**6 elif testword is "million": tempsum+=(len(word)+len(testword))*10**6+sumx[6] elif testword is "thousand": tempsum+=(len(word)+len(testword))*10**3+sumx[3] elif testword in n1: if maxx>=9: tempsum+=(len(word)+len(testword)+len('hundred'))*10**8+sumx[8]+len('million')*10**6 if maxx>=7: tempsum+=(len(word)+len(testword)+len('million'))*10**6+sumx[6] if maxx>=6: if 'hundred' in nextwords__: tempsum+=(len(word)+len(testword)+len('hundred'))*10**5+sumx[5]+len('thousand')*10**3 if maxx>=4: tempsum+=(len(word)+len(testword)+len('thousand'))*10**3+sumx[3] if maxx>=3: if 'hundred' in nextwords__: tempsum+=(len(word)+len(testword)+len('hundred'))*10**2+sumx[2] if maxx>=1: tempsum+=(len(word)+len(testword)) elif testword in n2: if maxx>=8: tempsum+=(len(word)+len(testword))*10**7+sumx[7]+len('million')*10**6 if maxx>=5: tempsum+=(len(word)+len(testword))*10**4+sumx[4]+len('thousand')*10**3 if maxx>=2: tempsum+=(len(word)+len(testword))*10**1+sumx[1] elif testword in n2t: if maxx>=8: tempsum+=(len(word)+len(testword)+len('million'))*10**6+sumx[6] if maxx>=5: tempsum+=(len(word)+len(testword)+len('thousand'))*10**3+sumx[3] if maxx>=2: tempsum+=(len(word)+len(testword)) return tempsum

to do this, i wrote two functions I haven’t explained yet:

for now i Hope the code/comments are self explanatory:

""" maxplaceofnextdigit has an input of a list of words like ['nine','million','eight','hundred','thousand', 'three'] and the output will be the maximum place of the next digit (number of places to the left of the decimal place) so on the sample input word list above, the output would be 2, since the maximum place of the next digit would corresond to 10^2 it only works for positive integers <=9 digits its recursive """ def maxplaceofnextdigit(words): if words==[]: maxxx = 9 """maximum digits of possible number""" else: previous=maxplaceofnextdigit(words[0:len(words)-1]) if words[len(words)-1] is "billion": maxxx=9 if words[len(words)-1] is "million": maxxx=6 if words[len(words)-1] is "thousand": maxxx=3 if words[len(words)-1] is "hundred": for n in [2,5,8,11,14]: if n<=previous: maxxx=n if words[len(words)-1] in n2: for n in [1,4,7,10,13]: if n<previous: maxxx=n if words[len(words)-1] in n2t: for n in [0,3,6,9,12]: if n<previous: maxxx=n if words[len(words)-1] in n1: for n in [0,2,3,5,6,8,9,11,12]: if n<previous: maxxx=n if words[len(words)-1] is '': maxxx=0 return maxxx

and:

""" if 'words' are a list of words of a number (in order left to right) as it would be spoken, beggning with the first word, and ending anywhere (full words only) nextwords() will return the possible next words word of the number only works for positive integers less than 10 digits. uknown results with improper numbers/words (like "three,hundred,thousand,seventeen,hundred") """ def nextwords(words): nextwords_=[""] words.reverse() if words!=[]: if "thousand" not in words and "million" not in words: nextwords_.append("million") nextwords_.append("thousand") elif "thousand" not in words: if words[0] is not "million": nextwords_.append("thousand") if words[0] in n1: if "hundred" in words: if "thousand" in words: if words.index("hundred")>=words.index("thousand") and words[1] not in n2: nextwords_.append("hundred") elif "million" in words: if words.index("hundred")>=words.index("million")and words[1] not in n2: nextwords_.append("hundred") elif "hundred" not in words: if len(words)>=2: if words[1] not in n2: nextwords_.append("hundred") elif len(words)<2: nextwords_.append("hundred") if words[0] not in n1+n2t: nextwords_=nextwords_+n1 if words[0] in ["hundred","thousand","million",""]: nextwords_=nextwords_+n2+n2t if words[0]=="": nextwords_=[] elif words==[]: nextwords_=['']+n1+n2+n2t words.reverse() nextwords_.sort() return nextwords_

so now we have the infrastructure to solve the puzzle, ready?

Let’s review our tools:

countpossibles() will count all the possible letters involved with all strings beginning with whatever input you give it.

nextwords() will return all possible words that could immediately follow those in the input list.

the idea that we dont have to worry about the order of huge sums of words in this giant string, as long as we count the sum properly, and the 51st billionth character isnt in the group, the order of the numbers is irrelevant.

We can in order look at the possible first words:

nextwords([]) = [‘’]+n1+n2+n2t =

[”, ‘eight’, ‘eighteen’, ‘eighty’, ‘eleven’, ‘fifteen’, ‘fifty’, ‘five’, ‘forty’, ‘four’, ‘fourteen’, ‘nine’, ‘nineteen’, ‘ninety’, ‘one’, ‘seven’, ‘seventeen’, ‘seventy’, ‘six’, ‘sixteen’, ‘sixty’, ‘ten’, ‘thirteen’, ‘thirty’, ‘three’, ‘twelve’, ‘twenty’, ‘two’]

so now lets count all letters in numbers that start with ‘’ (the empty string), well zero letters are in numbers without a first word.

countpossibles([‘’]) = 0

and now the number of letters in all possible numbers that start with ‘eight’ (positive integers <=9 digits)

and countpossibles([‘eight’] = 7,302,803,499

since that doesnt come close to the 51 billion we are wanting to count, we dont care about the order of all those numbers when alphabetized, the sum is sufficient!

and we can contine on adding these up until we surpass 51 billion, then we know the first word of the number containing the 51st billionth char, its the word where countpossibles([‘word’]) bumped the aggregate over 51billion.

before we get too far along let me point out one flaw in this algorithm, not all numbers will be grouped by the first name, for example..

the first few numbers will be :

‘eight’

‘eighteen’

‘eighteen million’ xxx xxx

‘eighteen thousand’ xxx

‘eight hundred’

‘eight million’

‘eight thousand’

‘eighty’

‘eighty eight’

. . .

point is the eighteens are in the middle of the eights, because the word eight exists in eighteen, so depending on the 2nd word, eighteen could come before it, like eight hundred comes after eighteen, but eight comes before eighteen. .. this scenario also occurs with nine/nineteen/ninety, seven/seventeen/seventy, six/sixty/sixteen

so we need to look ahead 1 word when grouping, the word in question and one word ahead will be sufficient to group all numbers with the leading word alphabetically.

so lets look at the first few numbers in the list of words:

note: [‘six’, ‘’] means ‘six’ as the first word, and ‘’ (nothing) as the second, indicating the word is finished… so our countpossibles([‘six’,’’]) won’t count six hundred, six thousands or all the other numbers starting with ‘six’.

0 letters in all words starting with [”] 51000000000 letters are left to traverse

5 letters in all words starting with [‘eight’, ”] 50999999995 letters are left to traverse

8 letters in all words starting with [‘eighteen’, ”] 50999999987 letters are left to traverse

59872000 letters in all words starting with [‘eighteen’, ‘million’] 50940127987 letters are left to traverse

34440 letters in all words starting with [‘eighteen’, ‘thousand’] 50940093547 letters are left to traverse

7245900054 letters in all words starting with [‘eight’, ‘hundred’] 43694193493 letters are left to traverse

. . .

61908450 letters in all words starting with [‘seventy’, ‘six’] 4391707697 letters are left to traverse

33440 letters in all words starting with [‘seventy’, ‘thousand’] 4391674257 letters are left to traverse

63910452 letters in all words starting with [‘seventy’, ‘three’] 4327763805 letters are left to traverse

61908450 letters in all words starting with [‘seventy’, ‘two’] 4265855355 letters are left to traverse

3 letters in all words starting with [‘six’, ”] 4265855352 letters are left to traverse

7045699854 letters in all words starting with [‘six’, ‘hundred’] -2779844502 letters are left to traverse

so far we have: [‘six’]

there are 4265855355 chars left to traverse

we finally get to a point where ‘six’ ‘hundred’ stepped over the 51 billion we are counting, so the number/word in question must begin with six… although really we can say the 2nd word will be ‘hundred’ because hundred isnt a word that exists as the beginning of another word, like six, eight, seven or nine, but for simplicity of the algorithm play dumb and continue with our heads down.

so we add back on all the letters we counted starting with ‘six’, 3+7045699854 and we again have a positive number of letters left to traverse, 4265855355.

lets repeat the same process appending possible following words to ‘six’

3 letters in all words starting with [‘six’, ”] 4265855352 letters are left to traverse

10 letters in all words starting with [‘six’, ‘hundred’, ”] 4265855342 letters are left to traverse

66913455 letters in all words starting with [‘six’, ‘hundred’, ‘eight’] 4198941887 letters are left to traverse

69916458 letters in all words starting with [‘six’, ‘hundred’, ‘eighteen’] 4129025429 letters are left to traverse

. . .

715180596 letters in all words starting with [‘six’, ‘hundred’, ‘ninety’] 752050874 letters are left to traverse

64911453 letters in all words starting with [‘six’, ‘hundred’, ‘one’] 687139421 letters are left to traverse

66913455 letters in all words starting with [‘six’, ‘hundred’, ‘seven’] 620225966 letters are left to traverse

70917459 letters in all words starting with [‘six’, ‘hundred’, ‘seventeen’] 549308507 letters are left to traverse

725190606 letters in all words starting with [‘six’, ‘hundred’, ‘seventy’] -175882099 letters are left to traverse

so far we have: [‘six’, ‘hundred’]

there are 4265855352 chars left to traverse

now as expected we over shoot the number of letters we are counting with hundred as the 2nd word.

lets find the third:

10 letters in all words starting with [‘six’, ‘hundred’, ”] 4265855342 letters are left to traverse

15 letters in all words starting with [‘six’, ‘hundred’, ‘eight’, ”] 4265855327 letters are left to traverse

18 letters in all words starting with [‘six’, ‘hundred’, ‘eighteen’, ”] 4265855309 letters are left to traverse

69872000 letters in all words starting with [‘six’, ‘hundred’, ‘eighteen’, ‘million’] 4195983309 letters are left to traverse

44440 letters in all words starting with [‘six’, ‘hundred’, ‘eighteen’, ‘thousand’] 4195938869 letters are left to traverse

66872000 letters in all words starting with [‘six’, ‘hundred’, ‘eight’, ‘million’] 4129066869 letters are left to traverse

41440 letters in all words starting with [‘six’, ‘hundred’, ‘eight’, ‘thousand’] 4129025429 letters are left to traverse

16 letters in all words starting with [‘six’, ‘hundred’, ‘eighty’, ”] 4129025413 letters are left to traverse

72919461 letters in all words starting with [‘six’, ‘hundred’, ‘eighty’, ‘eight’] 4056105952 letters are left to traverse

. . .

72919461 letters in all words starting with [‘six’, ‘hundred’, ‘seventy’, ‘nine’] 187757645 letters are left to traverse

71918460 letters in all words starting with [‘six’, ‘hundred’, ‘seventy’, ‘one’] 115839185 letters are left to traverse

73920462 letters in all words starting with [‘six’, ‘hundred’, ‘seventy’, ‘seven’] 41918723 letters are left to traverse

71918460 letters in all words starting with [‘six’, ‘hundred’, ‘seventy’, ‘six’] -29999737 letters are left to traverse

so far we have: [‘six’, ‘hundred’, ‘seventy’]

there are 549308507 chars left to traverse

and so on until we pinpoint the exact number the 51st billionth character exists in.

notice that each time we do this, we learn 1 word of the desired number , and there are at most 14 words in any integer less than a billion, so it’s not that bad.

This program takes about 90ms on my computer with python to find the 51st billionth character, but also will find the character at any other index in this sorted and concatenated string with 100% accuracy.

Again, I may have glossed over a couple details, so here’s the code:

""" "If the integers from 1 to 999,999,999 are written as words, sorted alphabetically, and concatenated, what is the 51 billionth letter?" this code will actually find ANY character in this with string 100% accuracy """ word="" words=[] find = 51*10**9 left=find while left >= 1: print("so far we have:",words) print("there are",left,"chars left to traverse") print("") next2words=[] county=[] for next1 in classy.nextwords(words): county.append(0) if next1!="": for next2 in classy.nextwords(words+[next1]): next2words.append([next1+next2,next1,next2]) next2words.append(['','','']) next2words.sort() for n in range(0,len(next2words)-1): if next2words[n][1]=='': add=[''] else: add=[next2words[n][1]]+[next2words[n][2]] tempsum=classy.countpossibles(words+add) if left>tempsum: left-=tempsum county[classy.nextwords(words).index(next2words[n][1])]+=tempsum elif tempsum>=left: left+=county[classy.nextwords(words).index(next2words[n][1])] words.append(next2words[n][1]) word=word+next2words[n][1] if len(word)>=left: print(words) print("done!, the ",find,"st/th character is ", word[left-1]) print(word[0:left-1]+'*'+word[left:len(word)]) left-=tempsum break

The code is for readability more than speed, could be heavily optimized, but still O( (log(n))^3 ) where n is the largest integer~=number of integers in the problem.

## Leave a Reply