Julie What is the *fastest* way in .NET to search large on-disk text files (100+ MB)

for a given string. The files are unindexed and unsorted, and for the purposes of my immediate

requirements, can't be indexed/sorted. I don't want to load the entire file into physical memory, memory-mapped files

are ok (and preferred). Speed/performance is a requirement -- the target is to

locate the string in 10 seconds or less for a 100 MB file. The search string

is typically 10 characters or less. Finally, I don't want to spawn out to an

external executable (e.g. grep), but include the algorithm/method directly in

Last » Hermit Dave i would suggest that you have a look at Regex implmentation. I think regex

is the fastest when it comes to scanning.

You might need to use filestream to load the file so i dont think its the

most appropriate answer.

anyways make a local copy of one of those files and give Regex a try. see if

it comes anywhere near the 10 sec mark. -- Regards, Hermit Dave

"Julie" <ju***@nospam.c om> wrote in message

out *why* you have to do this! 100mb flat file?? This is exactly the reason why relational databases were

made and are still used for just about everything. Without knowing more

about your app, I'd rather take the 2 minutes to load this into a SQL table,

build an index - and then what you want to do, suddenly becomes quick

(sub-second), simple and will support wildcards later. Maybe bulk-load your

file at night - and have your front-end hit the database during the day? I don't think you will be happy with just about any solution. Every response

you will get to this is either going to be way to slow -or- way too

complicated. You're re-inventing the wheel!! My $ .02 "Julie" <ju***@nospam.c om> wrote in message

Nov 16 '05 #3 cody Just load the Text file into a large string and use string.IndexOf( ) this

should be even faster than RegEx. --

"Julie" <ju***@nospam.c om> schrieb im Newsbeitrag

Nov 16 '05 #4 John Timney \(Microsoft MVP\) Given the size of the file, probably the only way would be to use a

filebytearray, load the file in as bytes 1 at a time and convert them to

chars by creating an indexer. You will need to work out a way of checking

I would start here. http://msdn.microsoft.com/library/de...us/csref/html/
vcwlkindexerstu torial.asp

vcwlkindexerstu torial.asp However, I very much doubt you will manage to scan 100 meg of data in 10

seconds. --

I wouldn't spend anymore time on see *if* you can do this, until you find

out *why* you have to do this!

All requirements have been defined at this point by project management. This

isn't just a blind decision, but the result of the examination of the domain

and expected results.

100mb flat file?? This is exactly the reason why relational databases were

made and are still used for just about everything. Without knowing more

about your app, I'd rather take the 2 minutes to load this into a SQL table,

build an index - and then what you want to do, suddenly becomes quick

(sub-second), simple and will support wildcards later. Maybe bulk-load your

file at night - and have your front-end hit the database during the day?

Yes, 100+ MB flat files. These are loosely formatted datafiles from external

laboratory instruments. Remember, proper implementation dictates that you implement what is required,

nothing more. The current requirements are simple access & management of these

text files that allows immediate searching that averages 10 seconds or less.

WinGrep accomplishes this in 6 seconds on our target system (1.3 GHz, 500 MB

RAM). Future requirements *may* dictate that additional time constraints are imposed,

which would then lead to db or other external indexing of the files. But, that

will be implemented when and if necessary. If you have questions about this

approach, you may want to look into the (industrial) extreme programming

paradigm, which is what our shop successfully uses.

I don't think you will be happy with just about any solution. Every response

you will get to this is either going to be way to slow -or- way too

complicated. You're re-inventing the wheel!!

Nay, my good friend, not re-inventing the wheel, but asking where the wheel

is. Text-searching of large files isn't uncommon or inappropriate. I'm just

looking into comments on such searches in .Net; this stuff is fairly trivial in

C++/Win32 (I'd prefer *not* to drop down to managed/unmanaged C++ for this

project). "Julie" <ju***@nospam.c om> wrote in message

Nov 16 '05 #6 Julie "John Timney (Microsoft MVP)" wrote:

Given the size of the file, probably the only way would be to use a

filebytearray, load the file in as bytes 1 at a time and convert them to

chars by creating an indexer. You will need to work out a way of checking

I would start here. http://msdn.microsoft.com/library/de...us/csref/html/
vcwlkindexerstu torial.asp

vcwlkindexerstu torial.asp However, I very much doubt you will manage to scan 100 meg of data in 10

seconds.

Thanks, I'll look into that. WinGrep performs the search in 6 seconds on our target system (1.3 GHz, 500 MB

RAM). (WinGrep is not open source, and C++/Win32.) --

Nov 16 '05 #7 Frans Bouma [C# MVP] Julie wrote: Nay, my good friend, not re-inventing the wheel, but asking where the wheel

is. Text-searching of large files isn't uncommon or inappropriate. I'm

just looking into comments on such searches in .Net; this stuff is fairly

trivial in C++/Win32 (I'd prefer not to drop down to managed/unmanaged C++

for this project). Searching text in textblocks should use the textsearch algorithm by

Knuth-Morris-More or the Boyer-Moore variant. These algorithms are much

faster than the brute force algorithms implemented in the string class. Algorithms in C by Sedgewick contains a description of these algorithms, and

I'm sure you'll find some descriptions on the internet. Basicly they come down to this:

string: ababababcababab abacababababacb abababacbababab c if you now try to find the string abc, do not start at teh first character,

but at the last. So in the string, the 3rd character is an 'a'. So abc will

never start at the first character of the string, so we can skip the first 3.

It works with skip arrays and is quite clever, it will tremendously speed up

string search, especially with large texts. Frans.

--

Microsoft C# MVP Nov 16 '05 #8 James Curran Julie wrote: What is the *fastest* way in .NET to search large on-disk text files

(100+ MB) for a given string. I don't want to load the entire file into physical memory,

memory-mapped files are ok (and preferred). The problem is that fast access to a large file requires direct access

to memory, which is antithetical to managed code. Your best choice would be

to isolate the search in a unmanaged C++ function, which is coded by the

managed C# app. Access the file as a memory mapped file is fairly easy in

unmanaged code, but I don't believe it's possible at all in a managed app. The best algorithm for searching your file is probably the Boyer-Moore

method. Moore himself has a cool webpage graphically demostrating it:

Truth,

William Stacey, MVP

