Skip to content
This repository was archived by the owner on Apr 10, 2025. It is now read-only.

Strip subresource hints #973

Closed
GoogleCodeExporter opened this issue Apr 6, 2015 · 30 comments
Closed

Strip subresource hints #973

GoogleCodeExporter opened this issue Apr 6, 2015 · 30 comments

Comments

@GoogleCodeExporter
Copy link

mod_pagespeed is not aware of subresource links, so doesn't correct them.

Original issue reported on code.google.com by [email protected] on 18 Aug 2014 at 6:58

@jeffkaufman
Copy link
Contributor

This is tricky. Imagine you have a page:

<link rel=subresource src=foo>
...
<img src=foo>

Then PageSpeed changes foo to foo.pagespeed.ic.HASH.jpg in the img but not the subresource. That's really bad, because it makes the subresource tag misleading to the browser, actively harmful.

Unfortunately we may not be able to correctly rewrite the subresource tag, because what url we use depends on context. For example, we might have resized an image, or combined some resources.

To fix this we need to:

  • Remove any existing subresource tags/headers
  • Add new ones using the property cache when we think it's worth it (even if the page doesn't have them already)

We should do the first part now, because we're making things worse. The second one is substantial work, and is more of a new feature, but would still be good.

I'm opening #1158 for the feature, and keeping this for the bug.

@jeffkaufman jeffkaufman changed the title Support rel="subresource" Strip subresource hints Oct 24, 2015
@oschaaf
Copy link
Member

oschaaf commented Oct 26, 2015

I'll take a look at stripping out any existing subresource tags/headers

@oschaaf oschaaf self-assigned this Oct 26, 2015
@jmarantz
Copy link
Contributor

I think we should use the pcache for this. If we strip all subresources
blindly we may remove ones that pagespeed will not be rewriting. In fact I
expect that would be a common case because subresources referenced from js
would benefit most from hints since the preloaded scanner would miss those.
On Oct 26, 2015 8:53 AM, "Otto van der Schaaf" [email protected]
wrote:

I'll take a look at stripping out any existing subresource tags/headers


Reply to this email directly or view it on GitHub
#973 (comment)
.

@jeffkaufman
Copy link
Contributor

@jmarantz Yes, I agree we should use the property cache, and I've filed #1158 for that.

If people are mostly using subresource tags to reference things loaded by js then we should leave existing ones in for now, while if they're mostly referencing things loaded declaratively then we should remove them. I'll try to find out.

@jmarantz
Copy link
Contributor

IMO there is not a strong reason to use subresource hints for resources
ref'd in html, as they can be pre-loaded. The only exception is when the
HTML starts streaming thru before (say) DB lookups are done.

On Mon, Oct 26, 2015 at 9:34 AM, Jeff Kaufman [email protected]
wrote:

@jmarantz https://github.com/jmarantz Yes, I agree we should use the
property cache, and I've filed #1158
#1158 for that.

If people are mostly using subresource tags to reference things loaded by
js then we should leave existing ones in for now, while if they're mostly
referencing things loaded declaratively then we should remove them. I'll
try to find out.


Reply to this email directly or view it on GitHub
#973 (comment)
.

@oschaaf oschaaf removed their assignment Oct 26, 2015
@jeffkaufman
Copy link
Contributor

I just ran a search for examples of pages with rel=subresource. I found subresource links for resources that were loaded:

  • 12 pages: js referenced from inline js
  • 11 pages: js referenced directly from page
  • 5 pages: css referenced directly from page
  • 3 pages: image referenced from page
  • 2 pages: json referenced from js referenced from page
  • 2 pages: css referenced directly from page before the subresource hint
  • 2 pages: js not referenced from anywhere (one was old version of show_ads_impl)
  • 1 page: font referenced from css referenced from page
  • 1 page: font referenced from js referenced from page
  • 1 page: font not referenced from anywhere
  • 1 page: image (loading gif) referenced from js referenced from page
  • 1 page: swf not loaded by page
  • 1 page: css not referenced from anywhere (print stylesheet)
  • 1 page: image referenced from inline css before the subresource hint
  • 1 page: image referenced from inline js
  • 1 page: css referenced from inline js

(When multiple pages were generated from one template I only counted that once.)

First, a bunch of these (7) are just making the page worse and it would be great to strip them out:

  • 2 pages: css referenced directly from page before the subresource hint
  • 2 pages: js not referenced from anywhere (one was old version of show_ads_impl)
  • 1 page: swf not loaded by page
  • 1 page: image referenced from inline css before the subresource hint
  • 1 page: css not referenced from anywhere (print stylesheet)

Of the remaining 39, PageSpeed currently would rename the referenced resource (making the rel=subresource harmful) 21 times and leave it alone the other 18.

So there are cases where there are useful hints that we would strip, but stripping is better than not stripping right now.

@jeffkaufman
Copy link
Contributor

If PreserveUrls is on we shouldn't strip, though.

@jeffkaufman
Copy link
Contributor

Do we have consensus that we should:

?

@oschaaf
Copy link
Member

oschaaf commented Nov 19, 2015

@jeffkaufman That makes sense to me

@keesspoelstra
Copy link
Contributor

Is it right to assume that we remove the link rel=subresource unless all preserve url settings(css/javascript/img) are on?

For the sake of performance: if we don't do lookups for the type of the resource in the link we will not know beforehand which type a subresource has, so we have to delete the link.
We can only keep the link if all preserve url settings (css/javascript/img) are on. Which is the effect of PreserveUrls .

@jmarantz
Copy link
Contributor

jmarantz commented Dec 2, 2015

Yes. Also I think we should retain these links when their URLs are
excluded due to the DomainLawyer or Allow/Disallow wildcards.

On Wed, Dec 2, 2015 at 8:41 AM, keesspoelstra [email protected]
wrote:

Is it right to assume that we remove the link rel=subresource unless all
preserve url settings(css/javascript/img) are on?

For the sake of performance: if we don't do lookups for the type of the
resource in the link we will not know beforehand which type a subresource
has, so we have to delete the link.
We can only keep the link if all preserve url settings
(css/javascript/img) are on. Which is the effect of PreserveUrls .


Reply to this email directly or view it on GitHub
#973 (comment)
.

@jeffkaufman
Copy link
Contributor

Agreed on both: keep them if PreserveUrls is fully on or they're excluded.

@keesspoelstra
Copy link
Contributor

Do we want a settings to enable/disable this , and should this be default on?

@jeffkaufman
Copy link
Contributor

Can we have a setting to turn this off, but default it to on?

If someone is using subresource hints to refer only to resources loaded from js or other ways pagespeed won't be able to find them, then they need a way to opt out of this behavior. But that's very few people per the survey I did above.

@keesspoelstra
Copy link
Contributor

Ok, maybe an idea to make the first pull without the setting, and a later second pull with.
We need some integration tests on the mod_pagespeed and ngx_pagespeed side for this setting I think?

@jeffkaufman
Copy link
Contributor

We should have something in automatic/system_test.sh to verify that these actually are stripped. Having that run on a pair of directories, one with pagespeed PreserveSubresourceHints on and the other without should be good?

@keesspoelstra
Copy link
Contributor

What about side effects/behavior with RewriteLevel PassThrough?

@jeffkaufman
Copy link
Contributor

RewriteLevel PassThrough

That's tricky. People expect that just setting things to PassThrough means we'll do nothing, but the current pull request doesn't do that.

I guess we could say "PassThrough means we don't strip hints, but then if you enable any other filters on top of that we do strip the hints"? But that seems like a big hack. Ideally we would only enable this if you turned on ant filters that might change urls resources, but I don't think we don't currently have a list of those and maintaining one would be annoying.

Anyone have thoughts? Right now I'm inclined to go with "yes, this means we'll make changes with PassThrough on" which is sad.

@oschaaf
Copy link
Member

oschaaf commented Dec 9, 2015

@jeffkaufman
I think that enabling hint stripping only when one or more (potentially) url-changing filters are enabled sounds nice.

Perhaps we could add a pure virtual method toHtmlFilter, e.g.:

virtual bool potentially_changes_urls() = 0;

Alternatively, a bitmask could be returned indicating which effects/prerequisites a filter has to allow more fine grained control in the pipeline (type of urls changed/html element change).
We could OR all these bitmasks and have an efficient/speedy overview of what the pipeline would potentially do or need to function properly.

Going through all the filters to implement the new method could be some work, but if we do it like that, would maintenance still be annoying? To lower the amount of work we'd have to do initially for this, we could opt to not go for a pure virtual and return the "worst case" as a default.

@jmarantz
Copy link
Contributor

I like this plan. I think it's OK to give it a default implementation of
'return 0' so you don't have to change all the filters everywhere in one
CL. Note that we have lots of fake filters in tests. Some filters are
implemented in google private source only so when we import a change with a
new pure virtual method we'd have to get those too. Most filters don't
change URLs, FWIW.

Note that a filter like rewrite_css doesn't change URLs if
preserve_css_urls() is on, so this virtual method implementation needs to
check the options.

-Josh

On Wed, Dec 9, 2015 at 4:04 PM, Otto van der Schaaf <
[email protected]> wrote:

@jeffkaufman https://github.com/jeffkaufman
I think that enabling hint stripping only when one or more (potentially)
url-changing filters are enabled sounds nice.

Perhaps we could add a pure virtual method toHtmlFilter, e.g.:

virtual bool potentially_changes_urls() = 0;

Alternatively, a bitmask could be returned indicating which
effects/prerequisites a filter has to allow more fine grained control in
the pipeline (type of urls changed/html element change).

We could OR all these bitmasks and have an efficient/speedy overview of
what the pipeline would potentially do or need to function properly.

Going through all the filters to implement the new method could be some
work, but if we do it like that, would maintenance still be annoying? To
lower the amount of work we'd have to do initially for this, we could opt
to not go for a pure virtual and return the "worst case" as a default.


Reply to this email directly or view it on GitHub
#973 (comment)
.

@keesspoelstra
Copy link
Contributor

You'd probably need to return the full bit_mask or true for the default case, because we need to account for worst case.
Don't know = might change I think?

@jmarantz
Copy link
Contributor

You're probably right -- defaulting to all 1s is safer. However it's too
conservative so pure virtual may be the way to go. This will probably
affect dozens of _test.cc with test filters. Maybe it's not worth having a
bit-mask; let's just stick with a bool for now and turn it into a bitmask
in the future if that makes sense.

Then we can have
virtual EmptyHtmlFilter::CanModifyUrls() { return false; }
virtual RewriteFilter::CanModifyUrls() { return true; }
Most filters are not RewriteFilters.

-Josh

On Thu, Dec 10, 2015 at 9:45 AM, keesspoelstra [email protected]
wrote:

You'd probably need to return the full bit_mask or true for the default
case, because we need to account for worst case.
Don't know = might change I think?


Reply to this email directly or view it on GitHub
#973 (comment)
.

@jeffkaufman
Copy link
Contributor

I like this.

@keesspoelstra
Copy link
Contributor

@jeffkaufman @jmarantz What would be a good place in the rewrite driver to calculate the combined CanModifyUrls?

Will it be ok to mirror DetermineEnabledFilters, and implement it in HtmlParse and RewriteDriver or do we want to use the AddFilter / EnableRewriteFilter / PrependRewriteFilter / AppendRewriteFilter / AddUnownedPostRenderFilter / etc..... functions to calculate the CanModifyUrls?

@jeffkaufman
Copy link
Contributor

@morlovich

@jmarantz
Copy link
Contributor

Btw....Looking in this area of the code at the representation and api ...
At HtmlElement and attribute values... is a good place to figure out what
changes we need to support utf8 URL rewriting, which I think is close to
the top of the accepted features queue

Josh
On Dec 23, 2015 8:25 AM, "keesspoelstra" [email protected] wrote:

@jeffkaufman https://github.com/jeffkaufman
https://github.com/pagespeed/mod_pagespeed/pull/1204/files#r47135564

A NULL value here could also be an src element which has characters
outside the normal encoding and will not decode, so potentially dangerous
if we encounter the same url in the document but it is url-encoded then.

We could do this:

  • no src element, keep the element
  • a src element but a value we do not understand (UTF8 etc), remove it.


Reply to this email directly or view it on GitHub
#973 (comment)
.

keesspoelstra added a commit to apache/incubator-pagespeed-ngx that referenced this issue Dec 24, 2015
jeffkaufman pushed a commit to apache/incubator-pagespeed-ngx that referenced this issue Mar 2, 2016
jeffkaufman pushed a commit to apache/incubator-pagespeed-ngx that referenced this issue Mar 4, 2016
@jeffkaufman
Copy link
Contributor

Fixed with 54983f4 and b651c78

jeffkaufman pushed a commit to apache/incubator-pagespeed-ngx that referenced this issue Mar 7, 2016
jeffkaufman pushed a commit that referenced this issue Mar 7, 2016
Default behaviour is to strip subresource links which are in scope for
pagespeed, these are the resources that are not disallowed or are valid
domains in the domain laywer.

Added can_modify_url flag to HtmlParse and CanModifyUrl function to the
HtmlFilter which indicates whether urls can be rewritten by the parser
and thus should be removed.  This is tested now in the
strip_subresource_hints_filter_test.cc as this is only used by the strip
subresource hints feature right now. This should be moved to the
HtmlParse tests.

DetermineEnabledFilters has been rolled up into
DetermineFiltersBehaviour, which also determines can_modify_url for all
the filters and possible future "behaviors".

Added new option to explicitly prevent the default behaviour:
ModPreserveSubresourceHints on/off

For ubuntu a check for the new setup /var/www/html instead of /var/www
for the document root has been added.

Fixes Issue #973.

(Squash of 54983f4 and b651c78.)
@jeffkaufman
Copy link
Contributor

Squashed as c6f0691 for backporting to 1.11.33.0.

jeffkaufman pushed a commit that referenced this issue Mar 9, 2016
Default behaviour is to strip subresource links which are in scope for
pagespeed, these are the resources that are not disallowed or are valid
domains in the domain laywer.

Added can_modify_url flag to HtmlParse and CanModifyUrl function to the
HtmlFilter which indicates whether urls can be rewritten by the parser
and thus should be removed.  This is tested now in the
strip_subresource_hints_filter_test.cc as this is only used by the strip
subresource hints feature right now. This should be moved to the
HtmlParse tests.

DetermineEnabledFilters has been rolled up into
DetermineFiltersBehaviour, which also determines can_modify_url for all
the filters and possible future "behaviors".

Added new option to explicitly prevent the default behaviour:
ModPreserveSubresourceHints on/off

For ubuntu a check for the new setup /var/www/html instead of /var/www
for the document root has been added.

Fixes Issue #973.

(Squash of 54983f4 and b651c78.)
jeffkaufman pushed a commit that referenced this issue Mar 9, 2016
Default behaviour is to strip subresource links which are in scope for
pagespeed, these are the resources that are not disallowed or are valid
domains in the domain laywer.

Added can_modify_url flag to HtmlParse and CanModifyUrl function to the
HtmlFilter which indicates whether urls can be rewritten by the parser
and thus should be removed.  This is tested now in the
strip_subresource_hints_filter_test.cc as this is only used by the strip
subresource hints feature right now. This should be moved to the
HtmlParse tests.

DetermineEnabledFilters has been rolled up into
DetermineFiltersBehaviour, which also determines can_modify_url for all
the filters and possible future "behaviors".

Added new option to explicitly prevent the default behaviour:
ModPreserveSubresourceHints on/off

For ubuntu a check for the new setup /var/www/html instead of /var/www
for the document root has been added.

Fixes Issue #973.

(Squash of 54983f4 and b651c78.)
jeffkaufman pushed a commit that referenced this issue Mar 9, 2016
Default behaviour is to strip subresource links which are in scope for
pagespeed, these are the resources that are not disallowed or are valid
domains in the domain laywer.

Added can_modify_url flag to HtmlParse and CanModifyUrl function to the
HtmlFilter which indicates whether urls can be rewritten by the parser
and thus should be removed.  This is tested now in the
strip_subresource_hints_filter_test.cc as this is only used by the strip
subresource hints feature right now. This should be moved to the
HtmlParse tests.

DetermineEnabledFilters has been rolled up into
DetermineFiltersBehaviour, which also determines can_modify_url for all
the filters and possible future "behaviors".

Added new option to explicitly prevent the default behaviour:
ModPreserveSubresourceHints on/off

For ubuntu a check for the new setup /var/www/html instead of /var/www
for the document root has been added.

Fixes Issue #973.

(Squash of 54983f4 and b651c78.)
jeffkaufman pushed a commit that referenced this issue Mar 11, 2016
Default behaviour is to strip subresource links which are in scope for
pagespeed, these are the resources that are not disallowed or are valid
domains in the domain laywer.

Added can_modify_url flag to HtmlParse and CanModifyUrl function to the
HtmlFilter which indicates whether urls can be rewritten by the parser
and thus should be removed.  This is tested now in the
strip_subresource_hints_filter_test.cc as this is only used by the strip
subresource hints feature right now. This should be moved to the
HtmlParse tests.

DetermineEnabledFilters has been rolled up into
DetermineFiltersBehaviour, which also determines can_modify_url for all
the filters and possible future "behaviors".

Added new option to explicitly prevent the default behaviour:
ModPreserveSubresourceHints on/off

For ubuntu a check for the new setup /var/www/html instead of /var/www
for the document root has been added.

Fixes Issue #973.

(Squash of 54983f4 and b651c78.)
jeffkaufman pushed a commit that referenced this issue Mar 11, 2016
Default behaviour is to strip subresource links which are in scope for
pagespeed, these are the resources that are not disallowed or are valid
domains in the domain laywer.

Added can_modify_url flag to HtmlParse and CanModifyUrl function to the
HtmlFilter which indicates whether urls can be rewritten by the parser
and thus should be removed.  This is tested now in the
strip_subresource_hints_filter_test.cc as this is only used by the strip
subresource hints feature right now. This should be moved to the
HtmlParse tests.

DetermineEnabledFilters has been rolled up into
DetermineFiltersBehaviour, which also determines can_modify_url for all
the filters and possible future "behaviors".

Added new option to explicitly prevent the default behaviour:
ModPreserveSubresourceHints on/off

For ubuntu a check for the new setup /var/www/html instead of /var/www
for the document root has been added.

Fixes Issue #973.

(Squash of 54983f4 and b651c78.)
jeffkaufman pushed a commit that referenced this issue Mar 11, 2016
Default behaviour is to strip subresource links which are in scope for
pagespeed, these are the resources that are not disallowed or are valid
domains in the domain laywer.

Added can_modify_url flag to HtmlParse and CanModifyUrl function to the
HtmlFilter which indicates whether urls can be rewritten by the parser
and thus should be removed.  This is tested now in the
strip_subresource_hints_filter_test.cc as this is only used by the strip
subresource hints feature right now. This should be moved to the
HtmlParse tests.

DetermineEnabledFilters has been rolled up into
DetermineFiltersBehaviour, which also determines can_modify_url for all
the filters and possible future "behaviors".

Added new option to explicitly prevent the default behaviour:
ModPreserveSubresourceHints on/off

For ubuntu a check for the new setup /var/www/html instead of /var/www
for the document root has been added.

Fixes Issue #973.

(Squash of 54983f4 and b651c78.)
jeffkaufman pushed a commit to apache/incubator-pagespeed-ngx that referenced this issue Mar 30, 2016
jeffkaufman pushed a commit to apache/incubator-pagespeed-ngx that referenced this issue Mar 30, 2016
@jeffkaufman
Copy link
Contributor

Released in 1.11.33.0.

@jeffkaufman
Copy link
Contributor

Two problems with this: #1393 #1392

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants