# [Solved] How to read specific rows of CSV file with fread function

I have a big CSV file of doubles (10 million by 500) and I only want to read in a few thousand rows of this file (at various locations between 1 and 10 million), defined by a binary vector `V`

of length 10 million, which assumes value `0`

if I don’t want to read the row and `1`

if I do want to read the row.

How do I get the io function `fread`

from the `data.table`

package to do this? I ask because `fread`

is so so fast compared to all other io approaches.

The best solution this question, Reading specific rows of large matrix data file, gives the following solution:

`read.csv( pipe( paste0("sed -n '" , paste0( c( 1 , which( V == 1 ) + 1 ) , collapse = "p; " ) , "p' C:/Data/target.csv" , collapse = "" ) ) , head=TRUE)`

where `C:/Data/target.csv`

is the large CSV file and `V`

is the vector of `0`

or `1`

.

However I have noticed that this is orders of magnitude slower than simply using `fread`

on the entire matrix, even if the `V`

will only be equal to `1`

for a small subset of the total number of rows.

Thus, since `fread`

on the whole matrix will dominate the above solution, how do I combine `fread`

(and specifically `fread`

) with row sampling?

This is not a duplicate because it is only about the function `fread`

.

Here’s my problem setup:

```
#create csv
csv <- do.call(rbind,lapply(1:50,function(i) { rnorm(5) }))
#my csv has a header:
colnames(csv) <- LETTERS[1:5]
#save csv
write.csv(csv,"/home/user/test_csv.csv",quote=FALSE,row.names=FALSE)
#create vector of 0s and 1s that I want to read the CSV from
read_vec <- rep(0,50)
read_vec[c(1,5,29)] <- 1 #I only want to read in 1st,5th,29th rows
#the following is the effect that I want, but I want an efficient approach to it:
csv <- read.csv("/home/user/test_csv.csv") #inefficient!
csv <- csv[which(read_vec==1),] #inefficient!
#the alternative approach, too slow when scaled up!
csv <- fread( pipe( paste0("sed -n '" , paste0( c( 1 , which( read_vec == 1 ) + 1 ) , collapse = "p; " ) , "p' /home/user/test_csv.csv" , collapse = "" ) ) , head=TRUE)
#the fastest approach yet still not optimal because it needs to read all rows
require(data.table)
csv <- data.matrix(fread('/home/user/test_csv.csv'))
csv <- csv[which(read_vec==1),]
```

##
Solution #1:

This approach takes a vector `v`

(corresponding to your `read_vec`

), identifies sequences of rows to read, feeds those to sequential calls to `fread(...)`

, and `rbinds`

the result together.

If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., `c(1:50, 55, 70, 100:500, 700:1500)`

) then there will be few calls to `fread(...)`

and you may see a significant improvement.

```
# create sample dataset
set.seed(1)
m <- matrix(rnorm(1e5),ncol=10)
csv <- data.frame(x=1:1e4,m)
write.csv(csv,"test.csv")
# s: rows we want to read
s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000)
# v: logical, T means read this row (equivalent to your read_vec)
v <- (1:1e4 %in% s)
seq <- rle(v)
idx <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1
# indx: start = starting row of sequence, length = length of sequence (compare to s)
indx <- data.frame(start=idx, length=seq$length[which(seq$values)])
library(data.table)
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))
```