c# streamreader write



Extremely Large Single-Line File Parse (2)

I am downloading data from a site and the site gives the data to me in very large blocks. Within the very large block, there are "chunks" that I need to parse individually. These "chunks" begin with "(ClinicalData)" and end with "(/ClinicalData)". Therefore, an example string would look something like:

(ClinicalData)(ID="1")(/ClinicalData)(ClinicalData)(ID="2")(/ClinicalData)(ClinicalData)(ID="3")(/ClinicalData)(ClinicalData)(ID="4")(/ClinicalData)(ClinicalData)(ID="5")(/ClinicalData)

Under "ideal" circumstances, the block is meant to be one-single line of data, however sometimes there are erroneous newline characters. Since I want to parse the (ClinicalData) chunks within the block, I want to make my data parse-able line-by-line. Therefore, I take the text file, read it all into a StringBuilder, remove new-lines (just in case), and then insert my own newlines, that way I can read line-by-line.

StringBuilder dataToWrite = new StringBuilder(File.ReadAllText(filepath), Int32.MaxValue);

// Need to clear newline characters just in case they exist.
dataToWrite.Replace("\n", "");

// set my own newline characters so the data becomes parse-able by line 
dataToWrite.Replace("<ClinicalData", "\n<ClinicalData");

// set the data back into a file, which is then used in a StreamReader to parse by lines.
File.WriteAllText(filepath, dataToWrite.ToString());

This has been working out great (albeit maybe not efficient, but at least it is friendly to me :)), until I have not encountered a chunk of data that is being given to me as a 280MB large file.

Now I am getting a System.OutOfMemoryException with this block and I just cannot figure out a way around it. I believe the issue is that StringBuilder cannot handle 280MB of straight text? Well, I have tried string splits, regex.match splits, and various other ways to break it into guaranteed "(ClinicalData) chunks, but I continue to get the memory exception. I have also had no luck in attempting to read pre-defined chunks (e.g.: using .ReadBytes).

Any suggestions on how to handle a 280MB large, potentially-but-might-not-actually-be single line of text would be great!

https://ffff65535.com


First off, I don't think you need to put all the text in a StringBuilder, since you aren't even concatenating parts to it. You could just try the following:

File.ReadAllText(filepath).Replace("\n", "").Replace("<ClinicalData", "\n<ClinicalData");

Why not try a StreamReader for this task? You can pick a "chunk" size that you want to read by and then split up those chunks into the (ClinicalData)data(/ClinicalData) parts. Here is some detailed code on how to do this:

        char[] buffer = new char[1024];
        string remainder = string.Empty;
        List<ClientData> list = new List<ClientData>();

        using (StreamReader reader = File.OpenText(@"source.txt"))
        {
            while (reader.Read(buffer, 0, 1024) > 0)
            {
                remainder = Parse(remainder + new string(buffer), list);
            }
        }

with the following method:

string Parse(string value, List<ClientData> list)
{
    string[] parts = value.Split(new string[1] { "</ClientData>" }, StringSplitOptions.None);
    for (int i = 0; i < parts.Length - 1; i++)
        list.Add(new ClientData(parts[i]));

    return parts[parts.Length - 1];
}

and the ClientData class however you have it implemented:

class ClientData
{
    public ClientData(string value)
    {
        // fill in however you are already parsing out ID, and other info
    }
}

There are many ways to implement something like this, but hopefully this can help get you started.

StreamReader's ReadLine() method is only one of the many ways you can read the text from the file. You can read into a buffer with a specified length, and then parse out the ClinicalData tags. I can provide an example if you'd like. http://msdn.microsoft.com/en-us/library/9kstw824%28v=vs.110%29.aspx

Alternately, if you are reading an XML file, XmlReader is another option. http://msdn.microsoft.com/en-us/library/system.xml.xmlreader%28v=vs.110%29.aspx


That's an extremely inefficient way to read a text file, let alone a large one. If you only need one pass, replacing or adding individual characters, you should use a StreamReader . If you only need one character of lookahead you only need to maintain a single intermediate state, something like:

enum ReadState
{
    Start,
    SawOpen
}


using (var sr = new StreamReader(@"path\to\clinic.txt"))
using (var sw = new StreamWriter(@"path\to\output.txt"))
{
    var rs = ReadState.Start;
    while (true)
    {
        var r = sr.Read();
        if (r < 0)
        {
            if (rs == ReadState.SawOpen)
                sw.Write('<');
            break;
        }

        char c = (char) r;
        if ((c == '\r') || (c == '\n'))
            continue;

        if (rs == ReadState.SawOpen)
        {
            if (c == 'C')
                sw.WriteLine();

            sw.Write('<');
            rs = ReadState.Start;
        }

        if (c == '<')
        {
            rs = ReadState.SawOpen;
            continue;
        }

        sw.Write(c);
    }
}




file-io