Feb 04, 2015

Geolocate website visitors ad-hoc

You want to see if you have any relevant website visitors from the webserver access logs alone? MaxMinds GeoIP Legacy Databases can help with some IP introspection.

1
2
3
4
5
6
7
8
9
#!/bin/bash
for ip in $(grep portfolio /var/log/apache2/access.log* | \
    grep -v -E '((84).(184|185|186).)|((93).(225).)' |\
    cut -d':' -f 2 | awk '{ print $1 }' | sort | uniq); \
do (echo "$ip"; geoiplookup -f ~/geoip/GeoLiteCity.dat $ip; \
    geoiplookup -f ~/geoip/GeoIPASNum.dat $i; \
    grep -m1 $i /var/log/apache2/access.log* | awk -F\" '{ print $6 }') | \
    paste -s | perl -pe 's/GeoIP.*?://g'; \
done

Paraphrasing: for every IP-adress that's not me, find location and output user-agent in columns to stdout. There might be more efficient structure with awk, but this for another day. Shell globbing holds some pitfalls, though. Building up a pipeline and going through it step-by-step helps mitigating errors and putting it into a script with variables in the end.

# sample output (altered ip)
77.11.44.22  DE, 16, Berlin, Berlin, 10625, 52.509201, 13.315800, 0, 0   AS6805 Telefonica Germany GmbH & Co.OHG    Mozilla/5.0 (Android; Mobile; rv:36.0) Gecko/36.0 Firefox/36.0

Update: geoloc is an alternative to geoiplookup.

You could grep in time intervals and be sent results if any changes occured. With cron, Msmtp and mailx, or writing to RSS-file. There will be regular crawlers and the usual vulnerability scanner paths, so you would need to do a lot for filtering to receive meaningful reports from your custom pipeline.

I want to be ignored

To create the input for the inverted grep ("-v") I had to record the IP-adresses my service provider choose to assign the connection for a few days.

(curl -s whatismyip.akamai.com; echo; date "+%Y-%m-%d %a %H:%M";) | paste -s >> .myip.history

Wrap it up in a script and add it to your crontab

0 */8 * * * ~/bin/myip-history.sh

and after a few days

grep -o '^[0-9]\{1,3\}.[0-9]\{1,3\}' ~/.myip.history | sort | uniq -c | sort -n | tac

it outputs to something like

    25 84.184
    16 84.186
    11 93.225
     8 2.245

with the top entries resembling the adress space you're mostly in.