2011 m. gruodžio 13 d., antradienis

Remove DTD declaration. Part II

In previous post there was method which removed DTD declaration from XmlDocument object. It is simple and nice solution unless there are a lot of big documents to process. XmlDocument class takes a lot of memory - usually about 8-10 times more than actual xml file. So, if you have 100 M xml file, you have up to 1G of memory used by XmlDocument object. I had to find another solution to remove DTD from the document.
I have asked for help of my collegue - I knew he was really great regular expression specialist, and he definately can help me. He came out with one simple method, which finds and removes everything what is between "" from the xml file (DECLARATION and comments). And this is done using String object and regex class, instead of heavy XmlDocument object.

Here is the code:
static public string RemoveDeclarations(String input)
                System.Text.RegularExpressions.Regex objRegExp = new System.Text.RegularExpressions.Regex("<![^>]+>");
                String input = objRegExp.Replace(input, String.Empty);
            catch { }
            return input;

Regular expression meaning:

[^>] - any count of any symbols except >;
+> - string should end with >;

All the character sequesces matching this criteria are replaced with empty strings. Thats it!

I've added try... catch block, in order to avoid suspending messages because of this method - if it fails, it returns unchanged string. This is not for every case - sometimes it is better to know where the problem is comming from.

Komentarų nėra:

Rašyti komentarą