Size limits for a csv file

Hi:

I have a rather large csv file downloaded from a government server. It is just shy of 200 columns and over 8 million rows/records. File size is approximately 8 GB. I keep getting error messages (out of memory) and crashes when attempting to load this into EDT. I have increased the maximum memory usage to 20 GB to no avail. Am I exceeding EDT’s limitations?

Windows 10
1 TB SSD (600 GB free)
16 GB RAM
i7

Jon

Is it running out of memory when adding the input, or when adding transforms?

Easy Data Transform keeps everything in memory, for speed. It tries to be clever with the memory using reference counting. How much memory is needed depends on how much data is in each cell and and how many unique values there are in each column. But 8m rows x 200 cols is pushing it. Especially if you start adding various transforms, which will use up more memory.

You could try setting the maximum memory significantly higher. But EDT will start to slow down if it has to use virtual memory (paging from RAM to disk).

Is it possible to split the file into several smaller section first?

If you can send me a copy of the file, I can investigate further. I recommend using swisstransfer.com or similar.

I loaded a 4GB (31 million rows x 15 cols) CSV file in EDT v1.27.0. it took 88% of 10 GB memory allowed. There are quite a lot of repeat data values in the is CSV (which can be compressed in memory).

No transforms, just loading the file. The file is HIPAA protected so I cannot share this particular one. But I have another I can share with very similar specifications. It fails too when loading but I was able to achieve what I needed using a text editor (of all things!).

Sent to your support email. Thanks for investigating.

Jon

I haven’t recieved an email from you at our support address:

Re-sent via my FilesAnyhere account.

Jon

Thanks. I will take a look when I get chance.

Ok, so i turns out that Easy Data Transform used 32 bit signed integers in part of the CSV file parsing. Your file has more than 2^31 characters. So the integrer overflows. Oops. I’m testing a fix now.

@JonPolish
I have fixed the issue and was able to load your 8Gb CSV file.

It takes about 15 minutes to load a file this huge into and it takes up around 8Gb of RAM (up to 12Gb during processing). I tried it with some dedicated CSV editors and they struggled as well.

Wow, that was fast. I see this update is not yet live on the site. You want me to test this first?

One text editor that deals with these large files fast and very well is EM Editor. In case you want to take a look, here is the text editor I hinted at above.

Jon

Oh, and thank you Andy.

Jon

1 Like

It is a snapshot release. Which means that it hasn’t been fully tested and is just released for customer feedback.

Yes, please.

The snapshot Easy Data Transform opens your 330 cols x 7 million rows 8Gb CSV files in ~14 minutes on my PC.

I tried a few editors for comparison on the same PC:

CSV editors:

  • csvEd - silently truncated everything after the first 112k rows!
  • ModernCSV - opened in read-only mode in 3 minutes. Ran out of memory in edit mode.
  • Ron’s CSV editor - after 30 minutes has read in 5m rows, but slowed to a crawl.

Text editors:

  • Notepad: Refused to open it.
  • Notepad++: Refused to open it.

Tried it with the file I could not share with you. It works and you are right - it is slow. Took almost 20 minutes to load. But the good news is that it does work.

I find EDT valuable to manipulate data from these files. Most are much smaller but I occasionally will need to process larger files.

You should check out EM Editor. It is fast, light and does not experience the limitations you mentioned with the other text editors.

Jon

I had a look at EMEditor. I think there are using various tricks to read the CSV fast. For example I believe they only have the cells you can see loaded into memory. I’m not sure that would be a good approach for Easy Data Transform. But we may use multi-threading to speed up reading at some point.