Feb 04, 2015

Geolocate website visitors ad-hoc

You want to see if you have any relevant website visitors from the webserver access logs alone? MaxMinds GeoIP Legacy Databases can help with some IP introspection.

for ip in $(grep portfolio /var/log/apache2/access.log* | \
    grep -v -E '((84).(184|185|186).)|((93).(225).)' |\
    cut -d':' -f 2 | awk '{ print $1 }' | sort | uniq); \
do (echo "$ip"; geoiplookup -f ~/geoip/GeoLiteCity.dat $ip; \
    geoiplookup -f ~/geoip/GeoIPASNum.dat $i; \
    grep -m1 $i /var/log/apache2/access.log* | awk -F\" '{ print $6 }') | \
    paste -s | perl -pe 's/GeoIP.*?://g'; \

Paraphrasing: for every IP-adress that's not me, find location and output user-agent in columns to stdout. There might be more efficient structure with awk, but this for another day. Shell globbing holds some pitfalls, though. Building up a pipeline and going through it step-by-step helps mitigating errors and putting it into a script with variables in the end.

# sample output (altered ip)  DE, 16, Berlin, Berlin, 10625, 52.509201, 13.315800, 0, 0   AS6805 Telefonica Germany GmbH & Co.OHG    Mozilla/5.0 (Android; Mobile; rv:36.0) Gecko/36.0 Firefox/36.0

Update: geoloc is an alternative to geoiplookup.

You could grep in time intervals and be sent results if any changes occured. With cron, Msmtp and mailx, or writing to RSS-file. There will be regular crawlers and the usual vulnerability scanner paths, so you would need to do a lot for filtering to receive meaningful reports from your custom pipeline.

I want to be ignored

To create the input for the inverted grep ("-v") I had to record the IP-adresses my service provider choose to assign the connection for a few days.

(curl -s whatismyip.akamai.com; echo; date "+%Y-%m-%d %a %H:%M";) | paste -s >> .myip.history

Wrap it up in a script and add it to your crontab

0 */8 * * * ~/bin/myip-history.sh

and after a few days

grep -o '^[0-9]\{1,3\}.[0-9]\{1,3\}' ~/.myip.history | sort | uniq -c | sort -n | tac

it outputs to something like

    25 84.184
    16 84.186
    11 93.225
     8 2.245

with the top entries resembling the adress space you're mostly in.