Slow replace performance by regex

zappatero · December 19, 2021, 9:53am

I have a csv file contains 46k rows. In one of the columns I have to extract all characters that are between the characters [ ]. I achieve this by regex [^]]+(?![^[]*])

Data example:
This is text like lorem ipmsum [this text I like to extract] and everything else can be ignored [except this part]. Thank you

The replace-transformation works, however it takes 20 minutes to proceed and during this time I can’t use EDT. RAM allocations already set to 16GB.

Any possibility to speed this up? Any better transformation approach?

Admin · December 19, 2021, 1:00pm

@zappatero Obviously regex replace is going to be significantly slower than plain text replace. However 20 minutes for 46k rows sounds very slow. Can you email us your input dataset and .transform so we can investigate? If the data is too sensitive to email us then please just send the first ~10 rows with any sensitive information modified or removed.

Admin · December 19, 2021, 1:35pm

I made a dataset with 50k rows of:

This is text like lorem ipmsum [this text I like to extract] and everything else can be ignored [except this part]. Thank you

When I ran it through the Replace transform on an old laptop, it did it in 29 seconds. So I am not sure why it is so slow in your case.

Can you see from the progress bar that it is spending all it’s time in the Replace?

zappatero · December 19, 2021, 2:17pm

Hey
Thanks for your fast reply. I will send you the files.
Yes, it consumed 1500+ seconds for this replace only.
Working on a M1 Mac.

Best,
A

Admin · December 19, 2021, 3:43pm

I’ve got the data thanks and will take a look. Hope to get you an answer within 24 hours.

Admin · December 19, 2021, 4:00pm

I had a quick look. I’m not a regex expert, but I think it may be the regex itsself.

With your regex my Windows PC take ~6 minutes. If I swap that for this simplified regex:

\[(.*)\]

Which just extracts between the first pair of [] then it only takes 1.17 seconds to do 46k rows!

Any regex experts care to comment?

Other approaches may be possible. is there a maximum number of [] pairs that a cell can contain?

zappatero · December 19, 2021, 4:32pm

Thx for your support!
I’m neither an regex export. But after quite long time, this was the only solution I found.
Yes, the inverse selection \[(.*)\] is very fast, but not what I’m need

The amount of [] is currently 6.

Admin · December 19, 2021, 5:16pm

Here is a alternative solution using the Javascript transform on column 3 to create a new column. I believe it does exactly what you want. It takes 0.985 second(s) seconds on your 46k dataset on my development PC.

var s = $(3);
var inBrackets = false;
var out = "";
for (var i = 0; i < s.length - 1; i++ ) 
{
   if ( inBrackets && s[i] == ']' )
   {
      out += ']';
      inBrackets = false;
   }
   else if ( !inBrackets && s[i] == '[' )
   {
      out += '[';
      inBrackets = true;
   }
   else if ( inBrackets )
   {
      out += s[i];
   }
}
if ( inBrackets )
{
   out += ']';
}
return out;

it should work with unlimited [] pairs, as long as they aren’t nested (one inside the other).

No doubt it is also possible to do it in regex much faster. But I’m not expert enough with regex to know how.

zappatero · December 20, 2021, 6:11am

I have tackled the regex again and the best solution I found was: [^\]](?![^\[]*\])
Maybe there is still even a better pattern, but this reduced the processing time from ~23min to 19seconds

But nothing works like charme as the javascript transformation which does the job in ~2 seconds!

Thanks for your great support. Hope other can profit from this insights as well.

Admin · December 20, 2021, 9:12am

Glad that helped. Did the forum mangle the regex in the above comment?

If so try using the preformatted text option (</>) above.

zappatero · December 20, 2021, 11:35am

Yes… regex was mangled. Corrected