-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
xml.dom.minidom does not escape CR, LF and TAB characters within attribute values #50002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Current behavior upon toxml() is: <foo attribute="multiline Upon reading the document again, the new line is normalized and Better behavior would be something like this (within attribute values only): <foo attribute="multiline value" /> |
Ok, I've tried to solve this problem, but I think that the keyword |
@francesco Sechi: Would it not just require a minimal change to the def _write_data(writer, data, is_attrib=False):
"Writes datachars to writer."
if is_attrib:
data = data.replace("\r", " ").replace("\n", " ")
data = data.replace("&", "&").replace("<", "<")
data = data.replace("\"", """).replace(">", ">")
writer.write(data) and in Element.writexml(): #[...]
for a_name in a_names:
writer.write(" %s=\"" % a_name)
_write_data(writer, attrs[a_name].value, True)
#[...] |
Of course it should be: def _write_data(writer, data, is_attrib=False):
"Writes datachars to writer."
data = data.replace("&", "&").replace("<", "<")
data = data.replace("\"", """).replace(">", ">")
if is_attrib:
data = data.replace("\r", " ").replace("\n", " ")
writer.write(data) |
Don't worry, I'm a newer too. |
Hmm... I thought toxml() is the part that needs to be fixed, not the My point is: The toxml() (i.e. _write_data) *actually writes* the |
Attaching a patch that fixes the problem. |
Attaching a test file that outlines the problem. Output on my system Without the patch: With the patch: |
I think that the problem is: the xmldoc1 has the newline or not? If it |
I try to explain better what is my opinion:
So your patch works only in a specific case: you are trying to fix a |
Francesco, I think you are missing the point. :-) The problem has two sides. If I create an XML document using the DOM (not by parsing it from a However, *literal* newlines in an attribute value (i.e. when the The catch: This leads to an actual data loss if I *wanted* to store In other words - the parsing process you refer to is actually working Minidom is clearly missing functionality here, and it does not conform |
All right, now I understand, thanks. But I think that, for internal |
A solution for this issue could be to replace the setAttribute method as
NOTE: I didn't do a patch, because I don't know which python version you Please try this solution and give me a feedback, thanks. |
I have uploaded a test script that shows that, without my patch, the |
Francesco, I'm not sure whether the proposed behavior is correct or desirable. Even Here's a test case for trunk. |
My position is: |
Francesco,
I believe you still don't see the issue. The behaviour is not symmetric The point is that parseString() behaves correctly, but serializing does
It would be pointless to do the encoding in setAttribute(). The valid However, if parseString() encounters a '
' in the input, it correctly |
Daniel Diniz: The proposed behaviour is correct: "In attribute values, the character information items Since the behaviour is correct, it is also desirable. :-) I don't think that this change could cause existing solution to break Thanks for putting up the unit test diff. |
I changed the patch to include support for TAB characters, which were Also I switched encoding from '
' etc. to '
'. This is |
see also a similar issue in etree: bpo-6492 |
@devon: Thanks for pointing & linking back here. |
Patched test_minidom and ran it test failed. Went to patch minidom.py and it appears up to date, so no idea why the test failed, can someone please take a look as it's 04:30 BST, thanks. |
minidom.patch had the new file listed before the old, so I've uploaded minidom.diff. The patch is tiny and looks clean. Tests have been repeated on Windows against 2.7 and are fine, so I believe this can be committed. |
bpo-7139 has been closed as duplicate of this and contains a few messages that might be worth reading. |
As a workaround until the patch gets included, you can import this monkey patch module. |
And while we're at it, we should also .replace('&', '&').replace('"', """).replace('<', '<') which would have to go at the beginning to avoid double-escaping the '&'. We could use xml.sax.saxutils.escape to do all the escaping rather than chaining replaces: data = escape(data, {'"':'"', '\r':'
', '\n':'
', '\t':'	'}) which also escapes '>' (not strictly required for attribute values, but shouldn't be harmful either). |
I tried to apply the minidom.diff patch below, but it seems that removing the two lines that replace the "<" and ">" characters is not a good idea. At least the part with the tabs seems to work now and if I add the two lines with the replace calls that got deleted by the patch, everything seems fine. |
Just want to mention that until the patch get included, it will be impossible to use the standard library to generate a working BCP (Bulk Copy Program) XML format file for SQL Server, which always requires a TERMINATOR="\r\n" or TERMINATOR=" " attribute. |
Just got bitten by this bug, which affects xml.etree.ElementTree and cElementTree too. Any chance to have it fixed? Note that lxml.etree is not affected by the bug and can be used as a replacement for the stdlib module if you happen to have some work to do. |
minidom may be broken, but what's the issue with ElementTree? >>> import xml.etree.cElementTree as etree
>>> doc = etree.fromstring('<xml />')
>>> doc.set('attr', "multiline\nvalue")
>>> etree.tostring(doc)
'<xml attr="multiline value" />' |
ElementTree issue is with tabs:
bye bye tab. |
I was going to open an issue on itself about etree and tabs, but I've found this and thought it would have been marked as duplicate. If you don't think it's the case I will open it anyway. I've added my comment because ElementTree is not reported in this page at all: I've been googling forever about "ElementTree tabs attributes" without result. I've found this (and the network of duplicates) only searching for less generic "xml tabs" into the python bug tracker, when I was already filing the bug. |
Added separate issue bpo-17582 as ElementTree implementation doesn't depend on whatever causes the bug in xml.dom.minidom. |
I had to make an updated version of this monkey patch for a project I was working on. I leave it here for anyone's reference. The old instance method overwrite was bombing out with https://gist.github.com/jbaker6953/deb9a8e4eae1e622f467fc9b4edf11db |
Also double quotes (") are now only quoted in attributes.
Also double quotes (") are now only quoted in attributes.
Also double quotes (") are now only quoted in attributes.
…-107947) Also double quotes (") are now only quoted in attributes.
Looks like this has been fixed by #107947. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: