ibmi-brunch-learn

Announcement

Collapse
No announcement yet.

XML-SAX Invalid characters - error code 6 - Help please

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • XML-SAX Invalid characters - error code 6 - Help please

    Hi,

    We've been reading in XML's into our RPGLE parser but every so often we get the following string come in (from an external B2B) that causes the XML Parser to fail with error code 6 - invalid characters in XML.

    The string snippet is ' canâÂ?Â?t ' it's the Â?Â? which is unicode U+0080 U+0099 - that is causing the problem in XML-SAX.

    The characters seem to be UTF-8 so the XML seems valid.

    This string seems to originally be (" can't " - but with) a rightquote, going through a double encoding problem from windows 1252 (Text copied from Outlook maybe) into a UTF-8 database via an application with a different encoding to UTF-8 or Windows. Anyway, this isn't our problem, and so we cannot resolve it - the problem is that these characters cause the XML-SAX Parser to fail with error code 6 - invalid characters in XML - even if I try to change the CCSID from the original 1252, to 1208, or any other CCSID.

    Is there a way to prevent these characters causing this error - we cannot change the XML contents, although we could build in a layer to scan the XML document before feeding it to the parser, for anything that is not in string of friendly characters and remove them - but this seems like a botch.

    Can we change the CCSID to something that allows it run ok, or is there something else I should do?

    Any help / comments would be appreciated.


  • #2
    Did you set the ccsid keyword in the %XML scalar function to the CCSID of yur XML-Document, for example ccsid=1208 or ccsid=13488?
    Birgitta

    Comment


    • #3
      Originally posted by B.Hauser View Post
      Did you set the ccsid keyword in the %XML scalar function to the CCSID of yur XML-Document, for example ccsid=1208 or ccsid=13488?
      Birgitta
      Actually, the value for the CCSID option is the CCSID of the data to be passed to the XML-SAX handler procedure. If there may be data in the XML file that cannot be converted to the job CCSID, code "ccsid=ucs2" in the XML-SAX options, and write the XML-SAX procedure to receive its data in UCS-2.

      Comment


      • #4
        From a Double Post
        Originally posted by Scott Klement View Post
        You say that the data is valid because it's UTF-8, but then later you say the CCSID is 1252. These are two different things. If the data is UTF-8, use the CHGATR command to set the CCSID to 1208.
        Sorry for the slow response, but I was not allowed to reply to my posts - but Jamie just fixed it for me.

        Scott, thanks for your response, but a simple XML with these characters in fails regardless of what I set the CCSID to using CHGATR. Even if I change the CCSID to 1208, it seems that XML-SAX does not like the chars Â?Â?


        So, If I had a simple XML like:-

        <?xml version="1.0" encoding="utf-8"?>
        <Header>
        <Data>ThatÂ?Â?s right</Data>
        </Header>

        We get the following Error: -

        Message ID . . . . . . : RNX0351 Severity . . . . . . . : 50
        Message type . . . . . : Escape

        Message . . . . : The XML parser detected error code 6.
        Cause . . . . . : While parsing an XML document for an RPG procedure, the
        parser detected an error at offset xxx with reason code 6. The actual
        document is *N; *N indicates that the XML document was not an external file.
        Recovery . . . : Contact the person responsible for program maintenance to
        determine the cause of the problem.

        What I want to achieve is a way to read the XML file / data without XML-sax issuing an error because of invalid data.
        When we receive data within an element, we check that it is valid data - but this combination causes the XML Parser to fail before returning the invalid data.

        As we do not know what other invalid character combinations may come through, that may upset XML-SAX, I envisage that if we receive these characters, we will have to scan through the XML data and check that each character is valid, and only if they are valid, copy them back into an xml, and write these back out. But this seems a little over kill.

        Is there any easier way that allows XML-SAX to return the data, and let our RemoveDodgyChar function to deal with it when is comes back as *XML_CHARS, rather than going to Error.

        Thanks for your help.

        Comment


        • #5
          Not that this helps but why not use the SQL XML functions. They handle this kind of stuff for you.
          Hunting down the future ms. Ex DeadManWalks. *certain restrictions apply

          Comment


          • #6
            Originally posted by DeadManWalks View Post
            Not that this helps but why not use the SQL XML functions. They handle this kind of stuff for you.

            Thanks for the reply, but because we already have a program with extensive mappings configured that works well already in production, until an invalid character appears, spoils the show, and requires effort to correct. So I'm looking for a resolution to XML-SAX.

            Comment


            • #7
              As far as I can tell, the data should be U+2019, which is a single quote character. In UTF-8 it is represented by 3 hex bytes x'E28099'. All I can guess here is that it is somehow getting mistranslated, so that what is fed into XML-SAX is no longer that 3 byte sequence, but something else?

              My original guess was only that XML-SAX didn't know that he data was UTF-8. Now that I've looked further, it seems the data is no longer correct UTF-8, since I do see a 4th byte. Something isn't right here. You need to troubleshoot it to find out what's actually going wrong.

              Comment


              • #8
                Hi Scott, Sorry for the really slow response. I just saw your comment and thought I'd update the thead.

                We ended up getting the company supplying the XML to strip out these characters that were causing a problem, as it was something in their ASP.Net Customer Service desk software that did this when pasting in email details. It was a moot point, as the characters don't fail a xml validator check, but do fail in xml-sax. - it was the only way to move forward easily.

                I investigated changing the Ccsid in the IFS but we never managed to get around this, although we only had so much time that we could spend on this problem - but basically we ran out of ideas, gave up and pushed the other end to make the change.

                Comment

                Working...
                X