Although I did not take the time to decode the hex, the fact the we see your hostname in there as well as “http/1.1” makes me think this could be TLS (SNI and ALPN).
Sounds like an issue on their side, unless you have a particular configuration, where port 443 can hit a http parser without decrypting TLS first.
Config does not do anything exotic (like TLS in a tcp mode frontend, with some SNI dance and other complicated stuff).
Yes, the hex code is indeed from a TLS client_hello. This means that facebook is accessing port 80 via HTTPS.
Could it be that someone just posted (or continues to post) a wrong URL in facebook, where a HTTPS url points to port 80:
https://fb.mysite.com:80/blabla
Do you see valid requests from those facebook crawlers also or just these wrong request?
I’d suggest you make test, because after all, the users in facebook are triggering those crawlers:
try posting a normal working http link from your site, while checking the haproxy logs.
try posting a normal working https link from your site, while checking the haproxy logs
try posting a https link that erroneously points to port 80 (like in the example above)
A private message should suffice to trigger this, you don’t have to post something publicly.
You could also capture port 80 traffic from those facebook ip ranges or all port 80 traffic (after all, you redirect everything to HTTPS, so it’s probably not a high volume).
Yes, most of the requests seem to be valid. I have seen users sharing content from the site just fine and I was personally able to share links as well. My Varnish cache is also showing a healthy number of cache hits.
Could it be that someone just posted (or continues to post) a wrong URL in facebook, where a HTTPS url points to port 80.
I doubt that, especially considering the number of BADREQ that I am getting. I did only switch to HTTPS in June of this year and the site had been operating for a year and a half before that. i.e. All of the original shares before June 2018 would have been to the HTTP site.
The fb.mysite.com subdomain is specifically for images that have been specified in the og:image tag, which is:
The URL of the image that appears when someone shares the content to Facebook.
I used the Facebook Sharing debugger to test a non-HTTPS link and it responded saying that it had followed the 301 redirect to the HTTPS site. The scraped image also appeared just fine.
Perhaps this is an issue with the crawler accessing previously shared non-HTTPS content?
I was returning to this topic and I noticed your statement above. Is there a way for me to FORCE these port 80 requests to port 443 if it’s a particular host? i.e. fb.mysite.com?