Programming
python algorithm search text-files binary-search
Updated Fri, 26 Aug 2022 21:16:39 GMT

Binary Search in large .txt with python (ordered by hash)


I would like to search a very large text file in which SHA1 Hashes are sorted by hash values using Python. The text file has 10GB and 500 000 000 lines. Each line looks like this:

000009F0DA8DA60DFCC93B39F3DD51A088ED3FD9:27

I compare thereby whether a given hash value occurs in the file. I tried it with BinarySearch, but it only works with a small test file. If I use the file with 10GB the search takes much too long and the process is killed sometime because 16GB RAM was exceeded.

f=open('testfile.txt', 'r')
text=f.readlines()
data=text
#print data
x = '000009F0DA8DA60DFCC93B39F3DD51A088ED3FD9:27'
def binarySearch(data, l, r, x):
  while l <= r:
    mid = l + (r - l)/2;
    # Check if x is present at mid
    if data[mid] == x:
        return mid
    # If x is greater, ignore left half
    elif data[mid] < x:
        l = mid + 1
        #print l
    # If x is smaller, ignore right half
    else:
        r = mid - 1
        #print r
# If we reach here, then the element
# was not present
  return -1
result = binarySearch(data,0, len(data)-1, x)
if result != -1:
   print "Element is present at index % d" % result
else:
   print "Element is not present in array"

Is there a way to load the 10GB text file once into RAM and access it over and over again? I have 16GB RAM available. That should be enough, right? Is there anything else I could do to speed up the search? Unfortunately I don't know any more.




Solution

Take your sample input as input.txt as below

000000005AD76BD555C1D6D771DE417A4B87E4B4:4
00000000A8DAE4228F821FB418F59826079BF368:3
00000000DD7F2A1C68A35673713783CA390C9E93:630
00000001E225B908BAC31C56DB04D892E47536E0:5
00000006BAB7FC3113AA73DE3589630FC08218E7:2
00000008CD1806EB7B9B46A8F87690B2AC16F617:4
0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8:15
0000000A1D4B746FAA3FD526FF6D5BC8052FDB38:16
0000000CAEF405439D57847A8657218C618160B2:15
0000000FC1C08E6454BED24F463EA2129E254D43:40

And remove the counts so your file becomes (in.txt below ):

000000005AD76BD555C1D6D771DE417A4B87E4B4
00000000A8DAE4228F821FB418F59826079BF368
00000000DD7F2A1C68A35673713783CA390C9E93
00000001E225B908BAC31C56DB04D892E47536E0
00000006BAB7FC3113AA73DE3589630FC08218E7
00000008CD1806EB7B9B46A8F87690B2AC16F617
0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8
0000000A1D4B746FAA3FD526FF6D5BC8052FDB38
0000000CAEF405439D57847A8657218C618160B2
0000000FC1C08E6454BED24F463EA2129E254D43

This will ensure you have fixed size for each entry.

Now you can use mmap based file reading approach as in here https://docs.python.org/3/library/mmap.html

import mmap
import os
FIELD_SIZE=40+1  # also include newline separator
def binarySearch(mm, l, r, x):
    while l <= r:
        mid = int(l + (r - l)/2);
        # Check if x is present at mid
        mid_slice = mm[mid*FIELD_SIZE:(mid+1)*FIELD_SIZE]
        mid_slice = mid_slice.decode('utf-8').strip()
        # print(mid_slice)
        if mid_slice == x:
            return mid
        # If x is greater, ignore left half
        elif mid_slice < x:
            l = mid + 1
            #print l
        # If x is smaller, ignore right half
        else:
            r = mid - 1
            #print r
    # If we reach here, then the element
    # was not present
    return -1
# text=f.readlines()
# data=text
#print data
x = '0000000CAEF405439D57847A8657218C618160B2'
with open('in.txt', 'r+b') as f:
    mm = mmap.mmap(f.fileno(), 0)
    f.seek(0, os.SEEK_END)
    size = f.tell()
    result = binarySearch(mm, 0, size/FIELD_SIZE, x)
    if result != -1:
        print("Element is present at index % d" % result)
    else:
        print("Element is not present in array")

OUTPUT:

$ python3 find.py 
Element is present at index  8

Since the file is not read completely in memory, there won't be out of memory errors.





Comments (5)

  • +0 – thank you very much! It works with a very good performance. I want to get it to run via CGI now. Unfortunately this does not work. Do you still have an idea what it could be? — Sep 28, 2019 at 07:47  
  • +0 – to be more concrete: IOError: [Errno 13] Permission denied is the Error. The Cgi Script has the correct rights (chmod 775). it looks like no access to the text file is possible — Sep 28, 2019 at 08:00  
  • +0 – EDIT: if i call it in the cgi directory with python file.py, i normally get the desired result. But if I call the whole thing via CGI, I get the permission error — Sep 28, 2019 at 10:09  
  • +0 – your file must be accessible by the user running web-server — Sep 28, 2019 at 10:19  
  • +0 – your right..Has to ask the administrator for that. let you know on monday if it worked :) — Sep 28, 2019 at 10:27  


External Links

External links referenced by this document: