Anant Shrivastava
17 july 2011
This Paper discusses about a relatively nascent field of Web Application finger printing, how automated web application fingerprinting is performed in the current scenarios, what are the visible shortcomings in the approach and then discussing about ways and means to avoid Web Application Finger Printing.
Finger printing in its simplest senses is a method used to identify objects. Same Term has been used to identify TCP/IP Stack Implementation and was known as TCP/IP finger printing. And similar usage has been extended lately to identify web applications Installed on the Http Server. If you know your enemies and know yourself, you can win a hundred battles without a single loss The Art of War (Chapter 3) in the same spirit Web Application finger printing is performed to identify the Application and software stacks running on the HTTP Server. Web Application finger printing is at its nascent stage as of now, however we are observing increasing awareness about it and large number of automated solution emerging in the market.
Web Application finger printing is a quintessential part of Information Gathering phase [4] of (ethical) hacking. It allows narrowing / drilling down on specifics instead of looking for all clues. Also an Accurately identified application can help us in quickly pinpointing known vulnerabilities and then moving ahead with remains aspects. This Step is also essential to allow pen tester to customize its payload or exploitation techniques based on the identification and to increase the chances of successful intrusion.
Historically Identification of Open Source applications have been easier as the behavior pattern and all the source codes are publically open. In the early days web application identification was as simple as looking in the footer of the Page of text like Powered by <XYZ>. However as more and more Server admin became aware of this simple stuff so is the Pen Testers approach became more complex towards identification of web application running on remote machine.
This is the simplest method in which manual approach is to open the site on browser and look at its source code, similarly on automated manner your tool will connect to site, download the page and then will run some basic regular expression patterns which can give you the results in yes or no. Basically what we are looking for is unique pattern specific to web software.
Examples of such patterns are
Meta Tag

Folder Names in Link section
Ever green notice at the bottom
URL pattern
http://<site_name>/OWA/
URL pattern: http://<site_name>/component/
URL Pattern: /_layouts/*
These regular expression’s combined together as a monolithic tool to identify all in one go or as a pluggable architecture for creating one pattern file for each type and work on it.
Example of tools using this technique includes browser plugin’s like Wapplyzer and web technology finder and similar tools.
This approach doesn’t download the page however it starts looking for obvious trails of an application by directly hitting the URL and in course identifying found and not found application list. In starting days of internet this was easy, just download headers and see if it’s 200 OK or 404 not found and you are done.
However in current scenario, people have been putting up custom 404 Pages and are actually sending 200 OK in case the page is not found. This complicates the efforts and hence the new approach is as follows.
Based on this assumption and knowledge this kind of tools start looking for known files and folders on a website and try to determine the exact application name and version.
Example of such scenario would be
This is relatively a newer approach considered by far as most accurate approach in terms on application and specific version identification.
This Technique basically works on below pattern.
One of the best implementation of this technique is Blind elephant
As you might have guessed these automation tools have certain disadvantages too.
Programming Language: Ruby
This is one of the beast application allowing a pluggable architecture with virtually any application detection as you can see in the below script-let this software is performing following tasks.
This effectively allows it to report application more accurately. As well as being pluggable in nature allows it to be customized for any application encountered.
Programming Language: JScript
Wapplyzer is a Firefox, Chrome Plugin, and works on only regular expression matching and doesn’t need anything other than the page to be loaded on browser. It works completely at the browser level and giv results in form of icon’s.
Programming Language: Python
This is a new entrant in the market and works on the principle of static file checksum based version difference.
As described by author at its home page, The Static File Fingerprinting Approach in One Picture
This again allows this software to work for both open-source software and closed source softwares, the condition is that the person running BlindElephant need to have access to source code to map all static file fingerprinting.
Basic logic is here :
Programming Language: Python
Plecost works on a simple principle of finding right files.
It derives the version of Wordpress from readme.html as shown below:
This section works for all open source Wordpress plugin’s which are available from wordpress.org site.
Basically it tries to fetch the readme.txt for each plugin and then based on that deduces the version of appliance installed on this server.
Note: wordpress.org makes it a mandate for every plugin author to have a correctly formed readme.txt file and hence chances of finding these files are too large.
Programming Language: Python
W3AF aims to be the metasploit of web, and hence is attracting quite an attention now a day. Below listed is among the first hand plugin’s of web application finger printing in W3AF.
This plugin again take a retro approach looks for exact file names and paths and moving on to look for specific data inside the file and if exist then deduce that application is Wordpress. This highlight is to stress on the fact about paths and flaws which we will be discussing after this section.
If you have looked carefully in the above details you will find lots of alarming things.
They assume that
These tools provide us results and in turn generate confidence; however everyone needs to understand that automated solution could also be fooled. And as such these tools should also be non authoritative in providing the output.
If you careful observe above discussions on each of scanning solution, you will see
Let’s analyze the statement
Does that rings some bell, known location is not necessary always the same, example in Wordpress we can change the folder location of wp-content as of now and in future we will be able to change other folder locations too (namely wp-admin, wp-include).
Effectively telling me that the application has hashes pre-computed and as such doesn’t take into consideration that user will do any manipulation of file.
This is one of the oldest methods of controlling what you don’t what others to see.
Few things to keep in mind.
One of the application specific finger printing applications is Plecost for Wordpress. As seen above it works on a straight forward hit a file if 200 OK with valid content plugin exist otherwise doesn’t exist. Similar kind of approach could be easily thwarted by using simple htacess based rules or redirection policies.
Example to beat Plecost all you need is block access to readme.txt (.html) from outside.
The reason why I suggest block and not removal as in case of removal during next upgrade all the files will be restored, however blocking will help in all cases.
#blocking access to readme.txt
<files readme.txt>
order allow,deny
deny from all
</files>
This should be enough to block access to readme.txt, similar steps could be performed to block majority of other files/folders.
What we are trying to do here is allow disallow everyone direct access to important files
This particular section is a hypothetical section as of now describing theoretical approach to beating the checksum based software versioning scheme. People are welcome to enhance this approach and integrate this in the defensive tool chain.
As already described above checksum based approach has inherent flaws in the design and hence could be forced to fall in its own trap. I have divided this part in 3 basic sections, text files, binary files, and incremental chaos.
The general approach for all the below is as listed
This section will deal with various techniques you can use to thwart checksum’s.
Manipulating image files would be a much more daunting task as this involves binary files.
Things we need to keep in mind
However we can still look at following prospective.
This is my favorite part; as we are already trying to confuse the automated solutions why not add more spice to the recipe. So here is my next hypothetical solution which is targeted towards making automated solutions more useless. When we are trying to hide our actual platform why not provide them something to reliable determine and give result, in simple terms
We will hide our original software using above described approaches, effectively the software might give a no result, relying on the confidence these automated solution present why not make sure the output is there and when reported with full confidence that could lead to lots of chaos.
So this is how we will proceed with this approach
Combined result INCREMENTAL CHAOS.
As we can see in the above steps step 2 and 3 are most critical ones we need to keep in mind following points while doing such a task:
This section is where I have tried to suggest some approaches which could be followed to work towards better scanners.
Right now tools rely heavily on regular expressions and as such could be easily defeated by providing fake text in comments or similar format. Parsing engines should be more intelligent and should be looking at the actual content and not whole text.
Another approach which could be deployed is instead of checksum based comparison we should have actual diff and weighted scores could be used to validate the differences however this would be resource intensive approach but could yield some good results however problem would be in version identification.
We should always cross refer other inputs example Apache server detected with aspx extension and we found through all automated checks that “SharePoint server” is present. This doesn’t seem right to human mind. Similar approach should be embedded in the logical section of the automated tool.
There are subtle hints provided by each framework in implementation of various protocols similar to http finger printing [2] that approach could also be used to detect the presence of a specific application.
Example of such implementation could be RSS/ATOM[6], XMLRPC[5].