Best unofficial Apache Server developers community |
| Aug 1, 2010 | |||
|
Florent ANDRE (JIRA) |
|
||
| Tags: | |||
Similar Threads
Created: (AVRO-556) Poor performance for Reader::readBytes can be easily improved
Poor performance for Reader::readBytes can be easily improved
DO NOT REPLY New: PATH_INFO normalization, especially relating to void path segments
https://issues.apache.org/bugzilla/show_bug.cgi?id=49396
Summary: PATH_INFO normalization, especially relating to void
path segments
Product: Apache httpd-2
Version: 2.2.15
Platform: PC
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P2
Component: Core
AssignedTo: bu### @httpd.apache.org
ReportedBy: thei### @iinet.net.au
The PATH_INFO request variable is treated by httpd as a path, which is
normalized to have dot segments or void path segments reduced (an empty
path
segment has traditionally, on UNIX, been treated as synonymous for a dot
segment, ie /./ ). This is almost always the desired behavior, but is
technically incorrect (the variable value itself, not how it is reduced),
and
can cause problems when a script/module cannot use PATH_INFO against
REQUEST_URI. My proposed solution is to add a RAW_PATH_INFO variable,
which
contains the PATH_INFO portion of the REQUEST_URI as it appears in
REQUEST_URI,
undecoded and unresolved (ie as received on the Request Line).
The rest of this report is my rationale/testing and is probably
superfluous and
certainly badly edited for brevity, so please feel free to ignore it
unless you
think you need some background.
The following URL:
/index.html/1/2//3/./4/../5
has a PATH_INFO of:
/1/2/3/5
The removal of the dot segments is correct per RFC 3986, which doesn't
recognize PATH_INFO other than as part of a path, and requires that dot
segments be normalized irrespective of whether they are path components or
opaque tokens (it's hierarchical so it is considered that it doesn't make
a
difference which type they are).
Note that most clients and/or intervening proxies will remove dot segments
as
part of their own resolution before they ever send the request to httpd.
So far, this is all correct behavior.
However, in the case of a void path segment (//), there is no
normalization
procedure defined as per RFC 3986 (or any of the others that deal with the
subject - it's almost as if they're deliberately avoiding addressing
it…).
So, a URL such as the following:
/index.html/http://example.com/index2.html
^^
would have a PATH_INFO of:
http:/example.com/index2.html
^
And since there are fewer characters in PATH_INFO than there are in the
PATH_INFO portion of REQUEST_URI, even after unencoding REQUEST_URI, it
becomes
extremely difficult to examine REQUEST_URI to determine the non-PATH_INFO
portion of the path, or the original PATH_INFO.
Now, in this example, the slashes after http: are character data and not
path
separators, and so they should be encoded as %2F, but there is no way for
the
client to know to do this because it cannot differentiate between what is
the
PATH_INFO and what is the path - only the server knows this, and it only
knows
it when it decides what script to call. The author of the URL is at fault,
but
the script has to deal with it anyhow, just like any other invalid data.
And
while the script might just be able to throw back a HTTP 400 error (or
other
error of its choice), scripts that need the original URI (for example, for
logging) without the PATH_INFO portion can't get it from REQUEST_URI (or
anywhere else) even after normalization, because the normal procedure of
simply
removing (length PATH_INFO) characters from a normalized REQUEST_URI won't
work
if extra characters have been removed.
(Not that the default httpd configuration would support such a PATH_INFO
if it
did have encoded slashes, but if you're expecting to deal with
non-filesystem
PATH_INFOs, it'd be up to you to know that you'd have AllowEncodedSlashes
on.)
The only way that a script can recover the URL sans PATH_INFO with it is
by
comparing the end of an unencoded REQUEST_URI (the number of characters
from
the right as there are in PATH_INFO) with the PATH_INFO and if they don't
match, then work backwards along REQUEST_URI looking for dot and void
segments
to add back into PATH_INFO until it matches (with special handling for
segments
at the very beginning of the PATH_INFO), and only then what's left of
REQUEST_URI is the non-PATH_INFO portion of the URL, and then applying its
own
segment resolution to PATH_INFO without collapsing void paths, to get the
PATH_INFO. (Even this is impossible if the last character of the script as
given in REQUEST_URI is an unencoded period ".", which would be rare and
silly,
but not impossible).
Certainly, I would agree that it's dumb to use the PATH_INFO for anything
other
than true files, as implied by RFC 3875 (you should use the Query string
instead). The point is that even if you ARE using PATH_INFO only for
normal
files, that when you do get certain kinds of requests (valid files or
not), you
can't isolate PATH_INFO from the REQUEST_URI. This realization came from
the
debugging of deliberately malformed URLs as a robustness test.
Changing the path resolution engine to not reduce void path segments in
PATH_INFO means that special code must be written for the resolution of
PATH_INFO (and it looks like a whole new subrequest, at least). Also,
using a
different resolution for PATH_INFO, from what is used for all other
resolutions, will probably break almost every existing script in the
universe
that uses it, if they encounter such a URL, because while it is very
unfortunate that a fundamental assumption of RFC 3875 is that all URLs
implicitly map to files on a filesystem, that certainly is indeed by far
the
most common use case (or certainly was, at the time).
By far the easiest, most compatible way of dealing with this is to add a
variable like RAW_PATH_INFO that doesn't feature path normalization or
escape
decoding; it's simply lopped off of the end of REQUEST_URI. Anyone who has
never cared can continue to not care, any anyone else can easily get what
they
need.
I'm not really sure of whether this constitutes a bug or a feature
request.
Strictly speaking, reducing void path segments is not required by
URL-related
specs, and implicitly prohibited (that is, they MAY be significant, and
you
can't just remove significant data because of assumptions like that they
represent a filesystem path). So, technically, the specific behavior of
removing void path segments from FILE_INFO is a bug.
On the other hand, it IS the desired behavior; the PATH_INFO is
specifically
intended to represent a filesystem path. Almost every script/module ever
written assumes that it will be a properly-formatted path (especially
since RFC
3875 requires that it be unencoded). And the way that it is currently
determined makes it very inefficient (or, complex) to fix.
Also, changing the resolving for PATH_INFO to preserve void segments will
not
entirely solve the discussed problem with it, because dot segments will
still,
correctly, be removed and the length of the PATH_INFO in the REQUEST_URI
will
remain as inscrutable as ever for such URLs.
So, adding the above-mentioned RAW_PATH_INFO would defer the argument over
whether void path segments are significant, but that's nothing less than a
naked feature request. So I classed it as a feature request.
Created: (HIVE-1483) Update AWS S3 log format deserializer regex
Update AWS S3 log format deserializer regex
Created: (FELIX-2380) [gogo] lock contention in piped writer when reader doesn't read all input
[gogo] lock contention in piped writer when reader doesn't read all input
Created: (MATH-374) [Enhancement] Request for public exception message strings on AbstractIntegerDis
[Enhancement] Request for public exception message strings on AbstractIntegerDistribution
Created: (TRANSACTION-40) Memory Leak in public InputStream readResource(Object resourceId)
Memory Leak in public InputStream readResource(Object resourceId)
Created: (CLK-676) ClickServlet should log all request parameter values
ClickServlet should log all request parameter values
Created: (CLK-685) Links should be able to restrict parameter binding for Ajax requests
Links should be able to restrict parameter binding for Ajax requests
Created: (CLI-206) If Parser.parse is called with a Properties parameter that contains an option not
If Parser.parse is called with a Properties parameter that contains an option not defined, get a NPE
Using Regex
All,
I am using pig embedded in Java and need to use matches in my pig job.
However when I try to use escape characters in the pig line, the
compiler complains. How do I use complex regex while embedding?
Sample code that is throwing errors:
myServer.registerQuery("filtered = FILTER firstcut BY dIP matches
'\Q34.21.12.*\E';");
error: invalid escape sequence.
Thanks,
Matt
Using variables in regex
Well, how do I use the content of a variable in regex?
$username = "user1"
file { "userdata.tar.bz2":
source => "puppet://$server/modules/$module/
userdata.tar.bz2",
ensure => $users ? {
/$username/ => absent,
default => present,
},
}
$users is a custom fact that contains all local users:
users => at avahi bin daemon dnsmasq ftp games haldaemon lp mail
messagebus nobody ntp polkituser postfix pulse root sshd suse uuidd
wwwrun man news uucp puppet user1
When I hardcode "user1" into the regex my test works fine and the file
is removed.
But things like /$variable/ or /\$variable/ or /#{variable}/ just
don't work.
Is it even possible in version 0.25.4?
Ask a question about regex in CRS
Hi, everyone
The following rule comes from
rules/base_rules/modsecurity_crs_41_sql_injection_attacks.conf , but I
don't understand what does the regular expression "(?:[\\\(\)\%#]|--)"
mean. What's the meaning of "\%" in a regex?
SecRule MATCHED_VAR "(?:[\\\(\)\%#]|--)"
"t:none,setvar:'tx.msg=%{rule.msg}',setvar:tx.sql_injection_score=+%{tx.critical_anomaly_score},setvar:tx.anomaly_score=+%{tx.critical_anomaly_score},setvar:tx.%{rule.id}-WEB_ATTACK/SQL_INJECTION-%{matched_var_name}=%{tx.0}"
Created: (CXF-2908) Using a Java enum type in a JAX-RS matrix parameter results in a StackOverflowEr
Using a Java enum type in a JAX-RS matrix parameter results in a StackOverflowError when generating the WADL
Updated: (AVRO-510) Memory leaks in datafile reader & writer.
[
https://issues.apache.org/jira/browse...nels:all-tabpanel
]
Bruce Mitchener updated AVRO-510:
Issues with Node Regex
I am trying to match groups of nodes - i.e.
Node: synd1-path2.path2.some.domain
Node: synd2-path2.path2.some.domain
By using either of the node definitions below:
node /^synd\w+\.path2\.some\.domain$/ {
include ibapps
include db
}
Using Regex in Embedded Pig in Java
All,
I am using pig embedded in Java and need to use matches in my pig job.
However when I try to use escape characters in the pig line, the
compiler complains. How do I use complex regex while embedding?
Sample code that is throwing errors:
myServer.registerQuery("filtered = FILTER firstcut BY dIP matches
'\Q34.21.12.*\E';");
error: invalid escape sequence.
Thanks,
Matt
Created: (FELIX-2432) [gogo] NPE/coercion failure when null first parameter to method expecting arry
[gogo] NPE/coercion failure when null first parameter to method expecting arry
A question about android regex implementation
Hi Jesse and All, I have written some simple benchmarks for harmony regex and find the performance of harmony is poor compared to RI. For example, Mathcer.find() only reach 60% of that of RI. I heard Android use icu4jni re-implement this module. Since icu4jni use native code I think it may has higher performance than harmony. I am trying to use icu4jni as the back-end of harmony regex but find icu4jni has no functions related to regex operations. I know there are some android guys in our community. So can anyone tell me some detail info for android's regex, like if it re-implement the regex logic using native code by android itself rather than icu4jni and really get higher performance compared to harmony regex? Thanks a lot!
Commented: (DERBY-4531) Client setCharacterStream closes its Reader argument stream in finalizer
[
https://issues.apache.org/jira/browse...4#action_12881194
]
Kristian Waagan commented on DERBY-4531:
use a statement in for_each() parameter
Jun 23, 2010 no public html folder Jul 25, 2010 Alias files Jun 18, 2010 .PHP files download instead of get interpreted Jul 18, 2010 Need help in running JSP files on my Linux hosting Jul 21, 2010 .htaccess redirect all files starting with "ab" Jun 7, 2010 Apache HTTPd and Tomcat, 404 for .php files Options There are currently too many top Jun 28, 2010 How to Transfer/Backup iPod files (Windows/Mac) May 29, 2010 How to force apache2 to have .zip files downloaded in binary mode? Jul 30, 2010 .htaccess RewriteRule - problem with paths with the same name as image files Aug 1, 2010 | |||
(8 lines) Sep 21, 2010 07:54
(8 lines) Sep 21, 2010 07:54
(8 lines) Sep 21, 2010 07:56