1

I am trying to fetch data from rss feed (feed location is http://www.bgsvetionik.com/rss/ ) in c# win form. Take a look at the following code:

public static XmlDocument FromUri(string uri) 
     {

        XmlDocument xmlDoc;
        WebClient webClient = new WebClient();

        using (Stream rssStream = webClient.OpenRead(uri))
        {
            XmlTextReader reader = new XmlTextReader(rssStream);
            xmlDoc = new XmlDocument();
            xmlDoc.XmlResolver = null;
            xmlDoc.Load(reader);
        }
        return xmlDoc;
   }

Although xmlDoc.InnerXml contains XML definition with UTF-8 encoding, I get š instead of š etc.

How can I solve it?

1 Answer 1

3

The feed's data is incorrect. The š is inside a CDATA section, so it isn't being treated as an entity by the XML parser.

If you look at the source XML, you'll find that there's a mixture of entities and "raw" characters, e.g. čišćenja in the middle of the first title.

If you need to correct that, you'll have to do it yourself with a Replace call - the XML parser is doing exactly what it's meant to.

EDIT: For the replacement, you could get hold of all the HTML entities and replace them one by one, or just find out which ones are actually being used. Then do:

string text = element.Value.Replace("š", "š")
                           .Replace(...);

Of course, this means that anything which is actually correctly escaped and should really be that text will get accidentally replaced... but such is the problem with broken data :(

7
  • @Jon Skeet Great answer; you just beat me to it ;-). Pardon my doing a bit of SO.meta, here (we can remove these comments in a few minutes), but I'm wondering why you reply in community wiki. I'm new to SO and would like to know the difference / accepted practice in this area.
    – mjv
    Commented Sep 25, 2009 at 14:03
  • If it wasn't in a CDATA section would it not simply error anyway since XML has no idea what that entity refers to? As far as I was aware XML only understand a very limited subset of entities that work in HTML. Its not uncommon for RSS feeds to abuse the description element by including html content in the description. Commented Sep 25, 2009 at 14:05
  • +1, it's what Hanselman calls "angle-bracket-delimited" data and not XML at all. BTW any reason why this is community wiki?
    – MarkJ
    Commented Sep 25, 2009 at 14:06
  • Thanks Jon! So, the only way to solve it is to make some Replacer() method which would replace all data from CDATA section?
    – Nikolan
    Commented Sep 25, 2009 at 14:31
  • @AnthonyWJones: I haven't checked whether the entity is being declared or not - but yet, I agree it's probably just a badly written feed. @mvj/MarkJ: I'm having a "rep holiday" until Monday, making all my posts CW. Pay no attention to that :)
    – Jon Skeet
    Commented Sep 25, 2009 at 14:53

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.