Skip to content
January 10, 2009 / steve8

#word problem, but first: why I’m an idiot:

So I’ve been working on this little puzzle:

“If the integers from 1 to 999,999,999 are written as words, sorted alphabetically, and concatenated, what is the 51 billionth letter?” from ita software

I outlined the framework of an algorithmic solution months ago when I first saw it, and i solved it with maple, but never got to implement it using a ‘proper’ programming language.

This problem came to mind again recently as a fun excuse to get my hands dirty with a computer programming language called python.

I’ll start at the end, and say I was totally frustrated, because my seemingly perfect algorithm gave the wrong answer, in spite of passing all sorts of verification tests… I finally got to the point I was so convinced that unless I failed to spell the 27 numbers in my native language’s naming scheme, the problem’s hint must have been wrong…. >>>

google

so 40 is ‘forty’, not ‘fourty’.  and then everything worked perfectly…

Question: Why not just create all the numbers as words, alphabetize them, and pick the 51st billionth letter?

Answer: Creating, then alphabetizing 1 billion strings, some as long as 93 letters, would take awhile.

we can sort a list of strings in O(n* log(n) ) where n is the number of strings/integers, but in our case the size of the strings is also related to n.

note the max number of letters / words / number of digits of the first n integers is ~ O(log(n))

example: all numbers less than a billion (10^9) ~ log(10^9)= 9 digits.

The traditional brute force solution might be possible with time complexity O(n*(log(n))^2) and memory requirement ~ O(n)

my solution has a time complexity of O( (log(n))^3)  and memory requirement ~ O(log(n)) <-to store the list of names of numbers (‘thousand’,’million’,’billion’, ‘trillion’…)

Let’s Start:

The number which has this 51st billionth character only has at most 14 words, so if we can narrow it down word by word, to where the 51st billionth word is, it would be really fast.

all the numbers that start with ‘five’ will be grouped together when listed alphabetically, so as I’m counting this list, it would be nice if I didn’t have to count all of them individually, especially when i dont care about the order of the numbers if im just summing them together

let’s see there’s:

five

five hundred x x

five thousand x x x

five hundred x x thousand x x x

five million x x x x x x

five hundred x x million x x x x x x

alphabetically they will be:

five

five hundred xx or five hundred xx xxx or five hundred xx xxx xxx

five million x x x x x x

five thousand x x x

lets consider for a moment the “five million xxx xxx”,

representing numbers 5,000,000 to 5,999,999 , thats 1,000,000 numbers.

so the words “five million” will occur a million times in these numbers that start with “five million”

there are 4 letters in ‘five’ and 7 letters in ‘million’ so that’ is 11 letters.

so there are 11 million letters involved with the first 2 words of all numbers starting with “five million”

what bout the xxx xxx? well it would be nice if we knew the total letters of all numbers w/ 6 or fewer digits?, wouldn’t it???… lets start small.

note: the word ‘zero’ will not exist in this list, because, for example, for 20 we say twenty, not twenty zero.

the sum of digits of all numbers of 1 digit, well its just the number of letters in this list:

n1=[“one”,”two”,”three”,”four”,”five”,”six”,”seven”,”eight”,”nine”]

so sumn[1]=number_of_letters(n1)=36

so the number of letters in single digit positive integers is 36

now for 2 digit numbers:

well for each word in

n2=[“twenty”,”thirty”,”forty”,”fifty”,”sixty”,”seventy”,”eighty”,”ninety”]

say for example, “thirty”

we have

30 thirty     (note we wouldn’t write ‘thirty zero’)

31 thirty one

32 thirty two

33 thirty three

34 thirty four

35 thirty five

36 thirty six

37 thirty seven

38 thirty eight

39 thirty nine

so as we see here, each word (and hence each letter) in n2 will occur 10 times, and for each of the 8 words, each n1 will occur once.

so lets count this, 10*number_of_letters_in(n2)+8*number_of_letters_in(n1)

number_of_letters_in(n1) is sumn[1], so, so far we can count 10*number_of_letters_in(n2)+8*sumn[1]

but lets not forget about the other two digit numbers:

n2t=[“ten”,”eleven”,”twelve”,”thirteen”,”fourteen”,”fifteen”,”sixteen”,”seventeen”,”eighteen”,”nineteen”]

each one of these will occur once, so in total

sumn[2]=10*number_of_letters_in(n2)+8*sumn[1]+number_of_letters_in(n2t)

since n2 contains 46 letters, n2t contains 70, and sum[1] we already computed to be 36:

the number of letters in all 2-digit numbers is: sumn[2]= 818

the number of letters in all numbers with 2 or fewer digits is sumx[2]=sumn[2]+sumn[1] = 854

I’ll illustrate one more:

for 3 digits again we have choices in n1 for the leading word.

n1=[“one”,”two”,”three”,”four”,”five”,”six”,”seven”,”eight”,”nine”]

(because you can’t lead a 3 digit number with “fifteen” or “fifty”, but you can with “five”

so lets pick ‘five’ to examine:

500

501

502

503

… so there are far too many to think about, but clearly we will have 500 to 599, which is 100 numbers,

so ‘five’ ‘hundred’ will occur 100 times, and then we have to count all the letters in the numbers 01-99,

but actually we already did, sumx[2] is the number of letters in positive integers of 2 digits or less..

there are 9 words in n1, so ‘hundred’ will occur 9*100*number_of_letters_in(‘hundred’), also sumx[2] will need to be counted 9 times, one for each leading number in n1.

sumn[3]= 9*100*number_of_letters_in(‘hundred’)+100*number_of_letters_in(n1)+9*sumx[2]

since hundred has 7 letters, n1 contains 36 letters, and sumx[2]=854:

sumn[3]= 17586

meaning there are 17,586 letters involved in all 3 digit numbers.

sumx[3]=sumn[3]+sumx[2] = 18440

meaning the number of letters in numbers with 3 or less digits is 18,440

you can see (hopefully) that we can find sumn[n] with a little bit of logic, with the location of n, considering if the numbers are going to need n1 or n2/n2t for a leading digit and the value of sum[n-1]

so if i were to write sumn(n) as a function it would probably use recursion, but instead ill just built up from sum[1] and store the values of sum[1] through sumn[9].

actually i really only care about sumx[n], where we take the number of letters in numbers n digits or less:

here are the values of sumx[0] to sumx[9]

[0, 36, 854, 18440, 292400, 3490000, 44872000, 547720000, 6034200000, 70305000000]

this is done with the python code below:

note: ‘len()’ is a magnitude function, measuring number of elements in a list, or number of letters in a word.

note2: x+=y adds y to x just like x=x+y does.

n1=["one","two","three","four","five","six","seven","eight","nine"]
n2t=["ten","eleven","twelve","thirteen","fourteen","fifteen","sixteen","seventeen","eighteen","nineteen"]
n2=["twenty","thirty","forty","fifty","sixty","seventy","eighty","ninety"]
def sumx9():
    """sumn is the array containing the number of chars for all postive integers of n digits"""
    sumn=[0]
    """sumx is th array containing the number of chars for all positive integers of <=n digits"""
    sumx=[0]
    for p in range(1,10):
        tempsum=0
        if p in (1,3,4,6,7,9):
            m=n1
        else:
            m=n2
            "the following is about the teen numbers"
            tempsum+=len(n2t)*sumx[p-2]
            for n in range(0,len(n2t)):
                tempsum+=len(n2t[n])*10**(p-2)

        """takes the smaller numbers' characters and adds it for each number in m"""
        tempsum+=len(m)*sumx[p-1]

        for n in range(0,len(m)):
            tempsum+=len(m[n])*10**(p-1)

        """
        the code below:
        this will count all the hundred/thousand/million words not included
        in the smaller numbers like 400,021.
        This number has a thousand, but 21 would not include the thousand,
        so we must include it.  These are only needed in rare cases
        since most numbers like 401,021 or 482,021, the number beyond the
        leading digit carry the hundred/thousand/million words in themselves.
        1,021 and 82,021 respectively both include the word thousand,
        and we don't wanna double count.
        """

        if p in (3,6,9):
            tempsum+=len(m)*len('hundred')*10**(p-1)

        if p in (4,5,6):
            tempsum+=len(m)*len('thousand')*10**3
            if p ==5:
                tempsum+=len(n2t)*len('thousand')*10**3
        if p in (7,8,9):
            tempsum+=len(m)*len('million')*10**6
            if p ==8:
                tempsum+=len(n2t)*len('million')*10**6
        sumn.append(tempsum)
        sumx.append(sumx[p-1]+tempsum)
        """
        ("the sum of the characters of all   ",p,"digit numbers is",sumn[p])
        print("the sum of the characters of all <=",p,"digit numbers is",sumnsum[p])
        """
    return sumx

so getting back to what I was saying before, how many letters are involved in all numbers starting with ‘five’ ‘million’

well we know there are 11 million letters involved with the words ‘five’ and ‘million’ since the pair will be written a million times, and what about the ~million different 6 (or less) digit numbers, since we are adding them up, we just need number of letters in all those numbers, and we do, sumx[6] is the number of letters in positive integers of 6 or fewer digits.

so the number of letters involved with numbers starting with ‘five’ ‘million’ is:

10^6*number_of_letters_in(‘five’,’million’)+sumx(6) = 11*10^11+44872000 = 55872000

see how easy that becomes?

so now we can do all of them

five = 4

five hundred x x + five hundred x x million x x x x x x + five hundred x x thousand x x x= 7145799954

five million x x x x x x = 55872000

five thousand x x x = 30440

so the total number of characters of all numbers starting with 5 is 7,201,702,398

I have sort of glossed over a few details here, but hopefully you can fill in the blanks, or read the code:

this function counts all the letters of all words beginning with some input list of strings:

"""countpossibles returns the total characters for all <=9 digit numbers .
starting with the words defined in order with the input string list
"""
def countpossibles(words2):
    sumx=sumx9()
    tempsum=0
    word=''
    words=[]
    if words2==[]:
        tempsum=sumx[9]
    else:
        nextwords__=nextwords(words2)
        for nn in range(0,len(words2)-1):
            word+=words2[nn]
            words+=[words2[nn]]

        maxx=maxplaceofnextdigit(words)
        testword=words2[len(words2)-1]
        if testword is "":
            tempsum+=len(word)
        elif testword is "hundred":
            tempsum+=(len(word)+len('hundred'))*10**2+sumx[2]
            if maxx>=5:
                tempsum+=(len(word)+len('hundred'))*10**5+sumx[5]+len('thousand')*10**3
            if maxx>=8:
                tempsum+=(len(word)+len('hundred'))*10**8+sumx[8]+len('million')*10**6

        elif testword is "million":
            tempsum+=(len(word)+len(testword))*10**6+sumx[6]

        elif testword is "thousand":
            tempsum+=(len(word)+len(testword))*10**3+sumx[3]

        elif testword in n1:
            if maxx>=9:
                tempsum+=(len(word)+len(testword)+len('hundred'))*10**8+sumx[8]+len('million')*10**6
            if maxx>=7:
                tempsum+=(len(word)+len(testword)+len('million'))*10**6+sumx[6]
            if maxx>=6:
                if 'hundred' in nextwords__:
                    tempsum+=(len(word)+len(testword)+len('hundred'))*10**5+sumx[5]+len('thousand')*10**3
            if maxx>=4:
                tempsum+=(len(word)+len(testword)+len('thousand'))*10**3+sumx[3]
            if maxx>=3:
                if 'hundred' in nextwords__:
                    tempsum+=(len(word)+len(testword)+len('hundred'))*10**2+sumx[2]
            if maxx>=1:
                tempsum+=(len(word)+len(testword))

        elif testword in n2:
            if maxx>=8:
                tempsum+=(len(word)+len(testword))*10**7+sumx[7]+len('million')*10**6
            if maxx>=5:
                tempsum+=(len(word)+len(testword))*10**4+sumx[4]+len('thousand')*10**3
            if maxx>=2:
                tempsum+=(len(word)+len(testword))*10**1+sumx[1]

        elif testword in n2t:
            if maxx>=8:
                tempsum+=(len(word)+len(testword)+len('million'))*10**6+sumx[6]
            if maxx>=5:
                tempsum+=(len(word)+len(testword)+len('thousand'))*10**3+sumx[3]
            if maxx>=2:
                tempsum+=(len(word)+len(testword))
    return tempsum

to do this, i wrote two functions I haven’t explained yet:

for now i Hope the code/comments are self explanatory:

"""
maxplaceofnextdigit has an input of a list of words
like ['nine','million','eight','hundred','thousand', 'three']

and the output will be the maximum place of the next digit
(number of places to the left of the decimal place)
so on the sample input word list above, the output would be 2,
since the maximum place of the next digit would corresond to 10^2

it only works for positive integers <=9 digits
its recursive
"""

def maxplaceofnextdigit(words):
    if words==[]:
        maxxx = 9
        """maximum digits of possible number"""
    else:
        previous=maxplaceofnextdigit(words[0:len(words)-1])
        if words[len(words)-1] is "billion":
            maxxx=9
        if words[len(words)-1] is "million":
            maxxx=6
        if words[len(words)-1] is "thousand":
            maxxx=3
        if words[len(words)-1] is "hundred":
            for n in [2,5,8,11,14]:
                if n<=previous:
                    maxxx=n
        if words[len(words)-1] in n2:
            for n in [1,4,7,10,13]:
                if n<previous:
                    maxxx=n
        if words[len(words)-1] in n2t:
            for n in [0,3,6,9,12]:
                if n<previous:
                    maxxx=n
        if words[len(words)-1] in n1:
            for n in [0,2,3,5,6,8,9,11,12]:
                if n<previous:
                    maxxx=n
        if words[len(words)-1] is '':
            maxxx=0
    return maxxx

and:

"""
if 'words' are a list of words of a number (in order left to right) as it would be spoken,
beggning with the first word, and ending anywhere (full words only)
nextwords() will return the possible next words word of the number

only works for positive integers less than 10 digits.

uknown results with improper numbers/words (like "three,hundred,thousand,seventeen,hundred")
"""


def nextwords(words):
    nextwords_=[""]
    words.reverse()

    if words!=[]:
        if "thousand" not in words and "million" not in words:

            nextwords_.append("million")
            nextwords_.append("thousand")

        elif "thousand" not in words:
            if words[0] is not "million":
                nextwords_.append("thousand")


        if words[0] in n1:
            if "hundred" in words:
                if "thousand" in words:
                    if words.index("hundred")>=words.index("thousand") and words[1] not in n2:
                        nextwords_.append("hundred")
                elif "million" in words:
                    if words.index("hundred")>=words.index("million")and words[1] not in n2:
                        nextwords_.append("hundred")
            elif "hundred" not in words:
                if len(words)>=2:
                    if words[1] not in n2:
                        nextwords_.append("hundred")
                elif len(words)<2:
                    nextwords_.append("hundred")

        if words[0] not in n1+n2t:
            nextwords_=nextwords_+n1

        if words[0] in ["hundred","thousand","million",""]:
            nextwords_=nextwords_+n2+n2t
        if words[0]=="":
            nextwords_=[]
    elif words==[]:
        nextwords_=['']+n1+n2+n2t


    words.reverse()
    nextwords_.sort()
    return nextwords_

so now we have the infrastructure to solve the puzzle, ready?

Let’s review our tools:

countpossibles() will count all the possible letters involved with all strings beginning with whatever input you give it.

nextwords() will return all possible words that could immediately follow those in the input list.

the idea that we dont have to worry about the order of huge sums of words in this giant string, as long as we count the sum properly, and the 51st billionth character isnt in the group, the order of the numbers is irrelevant.

We can in order look at the possible first words:

nextwords([]) = [‘’]+n1+n2+n2t =

[”, ‘eight’, ‘eighteen’, ‘eighty’, ‘eleven’, ‘fifteen’, ‘fifty’, ‘five’, ‘forty’, ‘four’, ‘fourteen’, ‘nine’, ‘nineteen’, ‘ninety’, ‘one’, ‘seven’, ‘seventeen’, ‘seventy’, ‘six’, ‘sixteen’, ‘sixty’, ‘ten’, ‘thirteen’, ‘thirty’, ‘three’, ‘twelve’, ‘twenty’, ‘two’]

so now lets count all letters in numbers that start with ‘’ (the empty string), well zero letters are in numbers without a first word.

countpossibles([‘’]) = 0

and now the number of letters in all possible numbers that start with ‘eight’ (positive integers <=9 digits)

and countpossibles([‘eight’] = 7,302,803,499

since that doesnt come close to the 51 billion we are wanting to count, we dont care about the order of all those numbers when alphabetized, the sum is sufficient!

and we can contine on adding these up until we surpass 51 billion, then we know the first word of the number containing the 51st billionth char, its the word where countpossibles([‘word’]) bumped the aggregate over 51billion.

before we get too far along let me point out one flaw in this algorithm, not all numbers will be grouped by the first name, for example..

the first few numbers will be :

‘eight’

‘eighteen’

‘eighteen million’ xxx xxx

‘eighteen thousand’ xxx

‘eight hundred’

‘eight million’

‘eight thousand’

‘eighty’

‘eighty eight’

. . .

point is the eighteens are in the middle of the eights, because the word eight exists in eighteen, so depending on the 2nd word, eighteen could come before it, like eight hundred comes after eighteen, but eight comes before eighteen. .. this scenario also occurs with nine/nineteen/ninety, seven/seventeen/seventy, six/sixty/sixteen

so we need to look ahead 1 word when grouping, the word in question and one word ahead will be sufficient to group all numbers with the leading word alphabetically.

so lets look at the first few numbers in the list of words:

note: [‘six’, ‘’] means ‘six’ as the first word, and ‘’ (nothing) as the second, indicating the word is finished… so our countpossibles([‘six’,’’]) won’t count six hundred, six thousands  or all the other numbers starting with ‘six’.

0  letters in all words starting with  [”] 51000000000  letters are left to traverse
5  letters in all words starting with  [‘eight’, ”] 50999999995  letters are left to traverse
8  letters in all words starting with  [‘eighteen’, ”] 50999999987  letters are left to traverse
59872000  letters in all words starting with  [‘eighteen’, ‘million’] 50940127987  letters are left to traverse
34440  letters in all words starting with  [‘eighteen’, ‘thousand’] 50940093547  letters are left to traverse
7245900054  letters in all words starting with  [‘eight’, ‘hundred’] 43694193493  letters are left to traverse
. . .

61908450  letters in all words starting with  [‘seventy’, ‘six’] 4391707697  letters are left to traverse
33440  letters in all words starting with  [‘seventy’, ‘thousand’] 4391674257  letters are left to traverse
63910452  letters in all words starting with  [‘seventy’, ‘three’] 4327763805  letters are left to traverse
61908450  letters in all words starting with  [‘seventy’, ‘two’] 4265855355  letters are left to traverse
3  letters in all words starting with  [‘six’, ”] 4265855352  letters are left to traverse
7045699854  letters in all words starting with  [‘six’, ‘hundred’] -2779844502  letters are left to traverse
so far we have: [‘six’]
there are 4265855355 chars left to traverse

we finally get to a point where ‘six’ ‘hundred’ stepped over the 51 billion we are counting, so the number/word in question must begin with six… although really we can say the 2nd word will be ‘hundred’ because hundred isnt a word that exists as the beginning of another word, like six, eight, seven or nine, but for simplicity of the algorithm play dumb and continue with our heads down.

so we add back on all the letters we counted starting with ‘six’, 3+7045699854 and we again have a positive number of letters left to traverse, 4265855355.

lets repeat the same process appending possible following words to ‘six’

3  letters in all words starting with  [‘six’, ”] 4265855352  letters are left to traverse
10  letters in all words starting with  [‘six’, ‘hundred’, ”] 4265855342  letters are left to traverse
66913455  letters in all words starting with  [‘six’, ‘hundred’, ‘eight’] 4198941887  letters are left to traverse
69916458  letters in all words starting with  [‘six’, ‘hundred’, ‘eighteen’] 4129025429  letters are left to traverse
. . .

715180596  letters in all words starting with  [‘six’, ‘hundred’, ‘ninety’] 752050874  letters are left to traverse
64911453  letters in all words starting with  [‘six’, ‘hundred’, ‘one’] 687139421  letters are left to traverse
66913455  letters in all words starting with  [‘six’, ‘hundred’, ‘seven’] 620225966  letters are left to traverse
70917459  letters in all words starting with  [‘six’, ‘hundred’, ‘seventeen’] 549308507  letters are left to traverse
725190606  letters in all words starting with  [‘six’, ‘hundred’, ‘seventy’] -175882099  letters are left to traverse
so far we have: [‘six’, ‘hundred’]
there are 4265855352 chars left to traverse

now as expected we over shoot the number of letters we are counting with hundred as the 2nd word.

lets find the third:

10  letters in all words starting with  [‘six’, ‘hundred’, ”] 4265855342  letters are left to traverse
15  letters in all words starting with  [‘six’, ‘hundred’, ‘eight’, ”] 4265855327  letters are left to traverse
18  letters in all words starting with  [‘six’, ‘hundred’, ‘eighteen’, ”] 4265855309  letters are left to traverse
69872000  letters in all words starting with  [‘six’, ‘hundred’, ‘eighteen’, ‘million’] 4195983309  letters are left to traverse
44440  letters in all words starting with  [‘six’, ‘hundred’, ‘eighteen’, ‘thousand’] 4195938869  letters are left to traverse
66872000  letters in all words starting with  [‘six’, ‘hundred’, ‘eight’, ‘million’] 4129066869  letters are left to traverse
41440  letters in all words starting with  [‘six’, ‘hundred’, ‘eight’, ‘thousand’] 4129025429  letters are left to traverse
16  letters in all words starting with  [‘six’, ‘hundred’, ‘eighty’, ”] 4129025413  letters are left to traverse
72919461  letters in all words starting with  [‘six’, ‘hundred’, ‘eighty’, ‘eight’] 4056105952  letters are left to traverse
. . .
72919461  letters in all words starting with  [‘six’, ‘hundred’, ‘seventy’, ‘nine’] 187757645  letters are left to traverse
71918460  letters in all words starting with  [‘six’, ‘hundred’, ‘seventy’, ‘one’] 115839185  letters are left to traverse
73920462  letters in all words starting with  [‘six’, ‘hundred’, ‘seventy’, ‘seven’] 41918723  letters are left to traverse
71918460  letters in all words starting with  [‘six’, ‘hundred’, ‘seventy’, ‘six’] -29999737  letters are left to traverse
so far we have: [‘six’, ‘hundred’, ‘seventy’]
there are 549308507 chars left to traverse

and so on until we pinpoint the exact number the 51st billionth character exists in.

notice that each time we do this, we learn 1 word of the desired number , and there are at most 14 words in any integer less than a billion, so it’s not that bad.

This program takes about 90ms on my computer with python to find the 51st billionth character, but also will find the character at any other index in this sorted and concatenated string with 100% accuracy.

Again, I may have glossed over a couple details, so here’s the code:

"""
"If the integers from 1 to 999,999,999 are written as words,
sorted alphabetically, and concatenated,
what is the 51 billionth letter?"


this code will actually find ANY character in this with string 100% accuracy

"""


word=""
words=[]
find = 51*10**9
left=find
while left >= 1:
    print("so far we have:",words)
    print("there are",left,"chars left to traverse")
    print("")
    next2words=[]
    county=[]
    for next1 in classy.nextwords(words):
        county.append(0)
        if next1!="":
            for next2 in classy.nextwords(words+[next1]):
                next2words.append([next1+next2,next1,next2])
    next2words.append(['','',''])
    next2words.sort()

    for n in range(0,len(next2words)-1):
        if next2words[n][1]=='':
            add=['']
        else:
            add=[next2words[n][1]]+[next2words[n][2]]
        tempsum=classy.countpossibles(words+add)

        if left>tempsum:
            left-=tempsum
            county[classy.nextwords(words).index(next2words[n][1])]+=tempsum

        elif tempsum>=left:
            left+=county[classy.nextwords(words).index(next2words[n][1])]
            words.append(next2words[n][1])
            word=word+next2words[n][1]


            if len(word)>=left:
                print(words)
                print("done!, the ",find,"st/th character is ", word[left-1])
                print(word[0:left-1]+'*'+word[left:len(word)])
                left-=tempsum
            break




The code is for readability more than speed, could be heavily optimized, but still O( (log(n))^3 ) where n is the largest integer~=number of integers in the problem.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: