Each Answer to this Q is separated by one/two green lines.
I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it.
As far as I can see – and implement – the CSV utility in Python only allows to iterate sequentially in the file.
It’s very memory consuming to read the all file into memory to use some random choosing and it’s very time consuming to go trough all the file and discard some values and choose others, so is there any way to choose some random line from the CSV file and read only that line?
I tried without success:
import csv with open('linear_e_LAN2A_F_0_435keV.csv') as file: reader = csv.reader(file) print reader[someRandomInteger]
A sample of the CSV file:
331.093,329.735 251.188,249.994 374.468,373.782 295.643,295.159 83.9058,0 380.709,116.221 352.238,351.891 183.809,182.615 257.277,201.302 61.4598,40.7106
import random filesize = 1500 #size of the really big file offset = random.randrange(filesize) f = open('really_big_file') f.seek(offset) #go to random position f.readline() # discard - bound to be partial line random_line = f.readline() # bingo! # extra to handle last/first line edge cases if len(random_line) == 0: # we have hit the end f.seek(0) random_line = f.readline() # so we'll grab the first line instead
As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:
Let’s assume (in this case) we have min=3 and max=15
1) Find the length (Lp) of the previous line.
Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time
If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.
We accomplish this by randomly keeping the line X% of the time where:
X = min / Lp
If we don’t keep the line, we do another random pick until our dice roll comes good. 🙂
I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it
Assuming you don’t need exactly 1 million lines and know then number of lines in your CSV file beforehand, you can use reservoir sampling to retrieve your random subset. Simply iterate through your data and for each line determine the chances of the line being selected. That way you only need a single pass of your data.
This works well if you need to extract the random samples often but the actual dataset changes infrequently (since you’ll only need to keep track of the number of entries each time the dataset changes).
chances_selected = desired_num_results / total_entries for line in csv.reader(file): if random() < chances_selected: result.append(line)
You can use a variation of the probabilistic method for choosing a random line in a file.
Instead of just keeping a single number that gets chosen, you can keep a buffer of size
C. For each line number,
n, in the file with
N lines, you want to choose that line with probability
C/n (rather than the original
1/n. If the number is selected, you then choose a random location from the C-length buffer to evict.
Here’s how it works:
import random C = 2 fpath="somelines.txt" buffer =  f = open(fpath, 'r') for line_num, line in enumerate(f): n = line_num + 1.0 r = random.random() if n <= C: buffer.append(line.strip()) elif r < C/n: loc = random.randint(0, C-1) buffer[loc] = line.strip()
This requires a single pass through the file (so it’s linear time) and returns exactly
C lines from the file. Each line will have probability
C/N of being selected.
To verify that the above works, I created a file with 5 lines containing a,b,c,d,e. I ran the code 10,000 times with C=2. This should produce about an even distribution of the 5 choose 2 (so 10) possible choices. The results:
a,b: 1046 b,c: 1018 b,e: 1014 a,c: 1003 c,d: 1002 d,e: 1000 c,e: 993 a,e: 992 a,d: 985 b,d: 947
If you want to grab random lines many times (e.g., mini-batches for machine learning), and you don’t mind scanning through the huge file once (without loading it into memory), then you can create a list of line indeces and use seek to quickly grab the lines (based off of Maria Zverina’s answer).
# Overhead: # Read the line locations into memory once. (If the lines are long, # this should take substantially less memory than the file itself.) fname="big_file" s =  linelocs = [s.append(s+len(n)) or s.pop(0) for n in open(fname)] f = open(fname) # Reopen the file. # Each subsequent iteration uses only the code below: # Grab a 1,000,000 line sample # I sorted these because I assume the seeks are faster that way. chosen = sorted(random.sample(linelocs, 1000000)) sampleLines =  for offset in chosen: f.seek(offset) sampleLines.append(f.readline()) # Now we can randomize if need be. random.shuffle(sampleLines)
If the lines are truly .csv format and NOT fixed field, then no, there’s not. You can crawl through the file once, indexing the byte offsets for each line, then when later needed only use the index set, but there’s no way to a priori predict the exact location of the line-terminating \n character for arbitrary csv files.
Another solution is possible if you know the total number of lines – generate 1 million random numbers (
random.sample(xrange(n), 1000000)) up to the total number of lines as a set, then use:
for i, line in enumerate(csvfile): if i in lines_to_grab: yield line
This will get you exactly 1 million lines in an unbiased way, but you need to have the number of lines beforehand.
If you can place this data in a sqlite3 database, selecting some number of random rows is trivial. You will not need to pre-read or pad lines in the file. Since sqlite data files are binary, you data file will be 1/3 to 1/2 smaller than CSV text.
Then use these statements to get 1,000,000 random rows:
mydb='csv.db' con=sqlite3.connect(mydb) with con: cur=con.cursor() cur.execute("SELECT * FROM csv ORDER BY RANDOM() LIMIT 1000000;") for row in cur.fetchall(): # now you have random rows...
You can rewrite the file with fixed-length records, and then perform random access on the intermediate file later:
ifile = file.open("inputfile.csv") ofile = file.open("intermediatefile.csv",'w') for line in ifile: ofile.write(line.rstrip('\n').ljust(15)+'\n')
Then, you can do:
import random ifile = file.open("intermediatefile.csv") lines =  samples = random.sample(range(nlines)) for sample in samples: ifile.seek(sample) lines.append(ifile.readline())
Requires more disk space, and the first program may take some time to run, but it allows unlimited later random access to records with the second.
# pass 1, count the number of rows in the file rowcount = sum(1 for line in file) # pass 2, select random lines file.seek(0) remaining = 1000000 for row in csv.reader(file): if random.randrange(rowcount) < remaining: print row remaining -= 1 rowcount -= 1
In this method, we generate a random number set whose number of elements is equal to the number of lines to be read, with its range being the number of rows present in the data. It is then sorted from smallest to largest and stored.
Then the csv file is read line by line and a
line_counter is in place to denote the row number. This
line_counter is then checked with the first element of the sorted random number list and if they are same then that specific line is written into the new csv file and the first element is removed from the list and the previously second element takes the place of the first and the cycle continues.
import random k=random.sample(xrange(No_of_rows_in_data),No_of_lines_to_be_read) Num=sorted(k) line_counter = 0 with open(input_file,'rb') as file_handle: reader = csv.reader(file_handle) with open(output_file,'wb') as outfile: a=csv.writer(outfile) for line in reader: line_counter += 1 if line_counter == Num: a.writerow(line) Num.remove(Num) if len(Num)==0: break
import pandas as pd import numpy as np filename = "data.csv" sample_size = 1000000 batch_size = 5000 rng = np.random.default_rng() sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size) sample = sample_reader.get_chunk(sample_size) for chunk in sample_reader: chunk.index = rng.integers(sample_size, size=len(chunk)) sample.loc[chunk.index] = chunk
For more details, please see the other answer.
def random_line(path, hint=1): with open(path, mode="rb") as file: import random while file.seek(random.randrange(file.seek(-2, 2))) and not file.readline(hint).endswith(b'\n'): pass return file.readline().decode().strip()
This is what i wrote for reading a random line from a very large file .
The time complexity is O(k) ,k is the average length of the lines in the text file.
The hint argument is the minimal length of the lines in the text file ,if you know it beforehand ,use it to speed up the function .
Always works for me
import csv import random randomINT = random.sample(range(1, 72655), 40000) with open(file.csv,"rU") as fp: reader = csv.reader(fp, delimiter=",", quotechar=""", dialect=csv.excel_tab) data_read = [row for idx, row in enumerate(reader) if idx in randomINT] for idx, line in enumerate(data_read): pass