Mutual auth & Certificate Revocation

Going beyond simple TLS includes mutual authentication and setting up revocation. Here’s my notes on these topics.

Over the last few days, I have been working on mutual authentication / client certificates. While working on it, I learnt a few concepts and tools that could be useful to others.

What is Mutual Authentication?

According to the Microsoft TechNet,

Mutual Authentication is a security feature in which a client process must prove its identity to a server, and the server must prove its identity to the client, before any application traffic is sent over the client-to-server connection.

Basically, it is a process where the client and server both have to present a certificate and both have to be verified for the TLS handshake to complete before the request/response to begin.

Why is it different?

In the normal TLS handshake (like regular HTTPS websites), the burden of proof is on the server. That is, when I connect to my travel site, I need to be sure that I am actually talking to the travel website’s server and not somebody else. I will gain my access by providing credentials. This is fine for most situations.

However, imagine I am an agent can can do bulk bookings. I may want lines of credit and confirmed bookings before my payment goes through. In such scenario, the travel site would go through extra verifications and ask for some deposit. Finally, they may give me a client certificate to ensure that they know that is actually me doing the booking.

As another use case, when your phone tries to talk to app store, it would need to prove that is indeed an Apple / Android device before proceeding with the app updates. The device manufacturer can embed the client certificate when the phone is manufactured so that it can be trusted.

CodeProject has an excellent article on setting up and testing this kind of a setup.

Certificate Verification challenge

In a normal TLS handshake, the server presents it’s certificate and intermediaries. Browsers have pre-loaded “Root” certificates. Using the root, the browser builds a trust chain and decides if it can trust or not trust the certificate.

A problem with this approach is that servers may be compromised. If a TLS cerificate is revoked by a CA, the browser will never know. To work-around this, 2 techniques are in use. These techniques try to ensure that when a server sends a certificate, the browser can query the certificate authority (CA) to see if the certificate is still valid.

  • Certificate Revocation List (CRL): In this approach, the Certificate Authority (CA) publishes a list of certificates it has issued and it’s status. The list is published at a fixed period or right after a certificate is revoked. The primary challenge with this approach is this CRL list keeps growing and overtime can get unwieldy.
  • Online Certificate Status Protocol (OCSP): In this case,
    the TLS certificate will list an OCSP domain. The client can send a request to this OCSP server with the cert that it is trying to verify. The OCSP responder then says if the certificate is valid or invalid. In this case, the response is of fixed length. Request is not unwieldy as well.

Both CRL and OCSP response can be digitally signed. OCSP provides a way to send a nonce value to reduce the risk of replay attacks. Unfortunately, not many responders support it and hence it is not very effective.

Browser support for Certificate revocation

There is an excellent but dated article on browser support at Spiderlabs titled “Defective by Design? – Certificate Design in modern browsers”.

According to Wikipedia and a blog post by Maikel, here’s the status:

  • Safari supports OCSP checking
  • IE, Opera and Firefox support CRL and OCSP. They do a soft-fail. That is, if the CRL/OCSP server is not reachable, the cert is loaded normally.
  • Chrome does not perform OCSP/CRL checks directly. It can be enabled if required. Chrome believes that all methods are currently not very effective and they follow a strategy outlined in their CRLSets page.

 

Command Line check for OCSP

Finally, if you do need to run check for an OCSP responder, Openssl has the commands to troubleshoot. Ivan Ristic‘s excellent blog post, Checking OCSP revocation using OpenSSL explains the process.

How to auto-upgrade to HTTPS (aka avoid mixed content)?

Full site HTTPS migration is hard. Consider using Content-Security-Policy header to make it easier.

tl;dr; Migrating to a full HTTPS site is hard. Using “Content-Security-Policy: upgrade-insecure-requests” can reduce the “mixed-content” errors for embedded objects. Finally, use Strict-Transport-Security header to secure the domain its sub-domains.

HTTPS Migration – The Challenge

In the recent past, there has been a lot of push to move websites to HTTPS. Google has been dangling the carrot of a better ranking by making HTTPS as a ranking factor.

However, the biggest issue is timing the migration. If the primary site moves to HTTPS and the embedded objects do not, then the browser will block the resources. It is better to move the embedded objects over to secure site and update the source code to change the reference from HTTP to HTTPs.

Yet, source code change is a long drawn and difficult process. In such a scenario, Content-Security-Policy will be your friend.

Content-Security-Policy (CSP)

As per W3C, CSP is:

..a mechanism by which web developers can control the resources which a particular page can fetch or execute, as well as a number of security-relevant policy decisions.

One of the directives is the upgrade-insecure-requests. When this directive is used as a header or a HTML meta-tag, the browser auto-upgrades requests to HTTPS.

As per documents, 2 kinds of links are upgraded:

  • Passive mixed content
    • Embedded links: These are the references to images, stylesheets and javascripts.
    • Navigational links: These are the links placed in the tags.
  • Active mixed content
    • These are the AJAX calls / XHR requests

However, not all requests are upgraded. We learnt this the hard way during a migration.

Gotcha# 1: Browser support

First off, not all browsers support CSP. As per caniuse.com, Firefox, Chrome and Opera are the browsers that support this directive. IE, Edge and Safari currently do not support it.

csp

Gotcha# 2: Exceptions

Although the W3C document mentions that navigational links are upgraded to https, both Chrome and Firefox have different interpretation

Here’s what Mozilla says about navigation links:

  • Links on same domain are upgraded
  • 3rd party links are not upgraded

Chrome on the other hand says this:

Note that having http:// in the href attribute of anchor tags () is often not a mixed content issue, with some notable exceptions discussed later.

So Chrome will not upgrade links to HTTPS.

Gotcha# 3: Third Parties

Third party content is not upgraded. Since browsers don’t know if those domains support HTTPS, they don’t upgrade. In the current versions, such content is silently blocked. You can find these blocked content by opening the developer tools in Firefox/Chrome and navigating to the console window. It would looks like this example:

active-mixed-content-errors

What’s next?

By using the CSP header, most of the embedded object errors can be removed. CSP supports reporting as well. By enabling this, you as the content publisher can get the set of URLs being blocked/warned by browsers and fix in the source code.

A subsequent change would be to use the Strict-Transport-Security header. This header should be enabled after the migration is complete and baked in. When this is used, the browser ensures that all requests to the domain (and sub-domains) are made over HTTPS. This will eliminate the short-comings the plain upgrade header.

How/Where to implement these changes?

As the upgrade directive and STS can be implemented with HTTP headers, you can introduce it at your web-server/proxy level or with your CDN. For more details on how CDN can help in such a setup, refer to my blog on “How can CDN help in SEO efforts?

How can CDN help in your SEO efforts?

CDN can help in more than just improving site-speed for SEO. Read about where CDNs are of use for your SEO efforts.

tl;dr;

CDNs can help in your SEO efforts beyond just speeding up the website. It can aid in better targeting, mobile friendliness, domain authority and more.

Background

This blog is a follow up for my earlier post on What metrics matter for SEO?. In this post, I’d like to explore how a CDN can aid in different aspects of SEO.

But, before we dig in, let’s me quickly re-iterate the value proposition of using a CDN.

Why use a CDN at all?

In the seminal study titled It’s the latency stupid, Stuart Cheshire tried to understand on which factor matters more for a “faster” website. Is it the bandwidth available or the latency? His conclusion was that beyond a point, bandwidth has no impact on the speed and it boils down to the latency.

So why is latency such a big speed killer? Simply put, the speed of data transfer over the internet is constrained by the distance from the user and the server. The best possible speed is at the speed of light. However, network components add some processing time and navigating the internet when distances are large means the speed is reduced.

CDNs deploy their servers such that end users talk to a server that is geographically and topographically closer to them. This way, the network hops and the network think time is reduced. This results in a website speeding up. The basic premise is that the user need only talk to the CDN server and not with the origin data center. So a user in Sydney, Australia will need to talk to a CDN server in Sydney instead of going all the way to a data center in New Jersey, USA. Anecdotally, this all makes sense.

CDNs also add better routing than the regular Internet’s routing and thus improve on the latency.

With this quick primer, let’s dig in to CDN and SEO!

Role of CDN in SEO

I’d like to discuss the following aspects of SEO. These are the areas where a CDN could help. I’ve pulled up these SEO factors from Moz’s report on Ranking factors.

  • Server response time
  • HTTPS / Secure sites
  • Domain authority
  • Mobile friendly website
  • URL Optimization
  • (Indirectly) Quality of other sites hosted on the same block of IP Addresses

Server response times

This metric is generally translated as TTFB in most studies that have looked at correlation data. However a CDN can help not just in improving the TTFB but, also in reducing the overall latency and thus improving other metrics like page load, start render or speed index.

Caching

The most economical and simple solution is to cache the page at the CDN. This should be sufficient as an immediate step towards a better SEO. If our mate in Sydney has to only talk to the CDN server in her city alone, it would be much faster than a request traveling to the server in New Jersey! Caching itself is more nuanced. An object could be cached at multiple places as explained in this post titled A tale of 4 cachesby Yoav Weiss. For our purpose, let’s focus on caching at CDN.

Streaming of response

In the HTTP world, a response can be generated in 2 ways:

  • Full response: Server waits until the entire response is created, potentially compresses and sends it down.
  • Chunking the response: Server starts to respond as soon as the bytes are available.

Full response is useful for smaller objects or objects that don’t have any processing. These would be objects like images, stylesheets, static HTML pages and pre-generated PDFs. Such objects are very good candidates for caching as well.

Chunking is useful for dynamic pages like personalized home page, reports, listing for hotel/flight reservations and so on. These pages could be cached but, may need to be qualified. eg: Cache if a user is not logged in.

Perceived performance optimizations

Once the basic caching optimizations are done, further tweaks could be made like lazy loading images and dynamically populating the page content. Google has confirmed that their bots are able to handle AJAX and so this is a safer tweak to make.

{{Does Bing support AJAX requests? I have not been able to find a document confirming this.}}

Content targeting

Since CDNs are aware of the user’s geographic location, you could target the site better. The same mechanism can be used to segment the cache as well so that the latencies are reduced and server response time across the different geo is optimized.

HTTPS

Google has been pushing for a secure websites. To encourage users, Google announced that it would not penalize users for the HTTP to HTTPS redirect. Setting up HTTPS is easier with a CDN. It is cost effective as well since Certificate authorities like Let’s Encrypt provide free SSL certificates.

However, HTTP to HTTPS migration is a lot more than just changing the protocol, as experienced by Wired. So plan it out thoroughly, even when using a CDN.

Mobile Friendly Websites

Building mobile friendly / responsive websites is hard. The primary challenge is to identify if a device is mobile/tablet or desktop. This could be avoided by building a purely responsive website where everybody gets the exact same set of resources and the browser displayes according to the device capabilities. However, it causes the most amount of resource bloat for the smallest of the devices. For more on issues around the responsive design, read Guy Podjarny’s blog post Responsive Web Design Makes It Hard To Be Fast.

A CDN could help in the responsive design / mobile friendly websites in multiple ways:

  • Device detection: A CDN could tell you if a device is a mobile / tablet / desktop. eg: Akamai’s solution here provides details on different aspects of the device:http://edc.edgesuite.net/. Using this, a server could vary the response and reduce the resource bloat
  • Mobile specific connection optimization: A CDN could detect the network capabilities and enforce better optimizations like image conversions, degrading image quality, and improving connection timeouts.
  • CDN logic: On the CDN, logic can be implemented to serve different resources based on device type and reduce the bandwidth used. This will help in speeding up the site and better user experience.
  • Site migration: When moving from a m.dot website to a new responsive website, effort is required to handle soft launch and to gradually migrate the bots. All of this routing decisons can be made on at the CDN layer.

URL and Domain tweaks

CDNs are like proxies. They can take request one an incoming domain+URL and send the request to totally different domain and URL. So SEO friendly domains could be setup, especially for domain shards. Similarly, user friendly URLs could be used for publishing through the CMS platforms. At the CDN, these URLs could be translated to the format expected by the back-end systems. This would help even your end users and ultimately result in better values from Google indexing and potentially a higher ranking on search results.

For example, you could have a publishing URL like “www.mydomain.com/coats/black-winter-coat”. At the CDN, this could be translated to the CMS URL as “www.mydomain.com/p/a/ac?pid=123”.

Such optimizations also plays well with the maximum character length restrictions recommendations of a CDN as well.

Domain Authority

First let’s understand the meaning of domain authority. Moz defines it as as follows:

Domain Authority is a score (on a 100-point scale) developed by Moz that predicts how well a website will rank on search engines.
Moz: Domain Authority

What’s the issue

Basically, if you are amazon.com, you just get ranked higher than mom&pop.com. It’s because Google has seen amazon.com being delivering results that are relevant and popular. It has a brand that is trusted. Hence Google rewards it by giving it a higher domain authority.

Other factors that goes into this authority is the use of relevant name. For example, a website about “Web Analytics” named webanalyticsexplained.com will rank better than if it were named as johndoesblog.com. Similarly, domain names that have hyphens and numbers are considered to have lower trust rating and marked down.

However, domains are hard to setup and managing them would involve effort and time. Case point would be an organization hosting a big event. Suppose http://www.mycompany.com is hosting a super famous SEO conference called “Best SEO Meet Ever (BSME)”. The IT team would find it easier to simply re-use the existing data center and existing firewall rules. In such a case, hosting the conference site on a new domain “www.bsme.com” may be get complicated.

CDN To the rescue

With a CDN, domains can be spun up and brought down while the origin data center details remains unchanged. So the CDN could be told to send requests for http://www.bsme.com to the parent site on a special path like http://www.mycompany.com/bsme/. Once the event is over, the CDN could even setup a 301 redirect to the parent site so that the audience earned are not lost.

Bottom line: It is possible to target keyword with a domain and setup it up on a CDN than trying to do the entire setup at origin data-center.

Other CDN optimizations

Handling Failures

When a site suffers outage and Bots try to index and receive error, they are confused. With a CDN it would be possible to setup fail-over mechanisms. Origin failures that are temporary could be coded to respond with a 500. If a maintenance is planned, the CDN can be coded to respond with a 503 and a “Retry-After” header. In call cases, CDN could respond with a simple HTML message that explains the issue as well so that the real users are not left in confusion.

This ensures bots get the right message and don’t index the wrong page or ignore a site for longer than necessary.

Stale pages / Redirects

When websites change, they leave behind a legacy of 404s. They are bad for users and are a missed opportunity for SEO. Using CDN, these 404s can be corrected to respond with a redirect to the updated content. This is both a good user behavior and a way of retaining the link juice.

Do note that redirects have specific meaning with respect to SEO. Have a look at this blog post for more details.

A/B, Multivariate testing

CDNs being proxies can act as a control point for your multivariate testing. At the CDN you could definte the logic like sending 10% of mobile traffic to a new site design and then tracking them with analytics or RUM to measure the success criteria.

Conclusion

In this post, I’ve tried to cover the reason why a CDN can help in SEO efforts. Apart from improving the latency and bumping up the site speed, CDNs could help in addressing issues of domain authority, managing vanity URLs and targeting efforts.

If you’d like to know more, DM me @rakshay.

What metrics matter for SEO?

tl;dr;
Google is very nuanced in the way it handles the Site speed. It appears to rely on some combination of TTFB coupled with rendering metric like Time to first paint / start render or DomInteractive. However, it is very hard to find the exact metric. So focus on delivering the best performance to user and Google will automatically rank you well!

Background

Before giving away the answer on which metric matters, let’s first study the published studies around this topic.

Recently, there was a study published that seemed with the title Does Site Speed Really Matter to SEO?. Its main conclusion was that TTFB is the metric that strongly correlates to a higher google ranking. So, this was generally considered to be the smoking gun and the metric that needs to be optimized if you want a better ranking on Google.

A study pretty similar to this one was published a few years back by Zoompf. The study was summarized on the Moz site under the title How Website Speed Actually Impacts Search Ranking. Again, the basic premise was that TTFB had a much higher correlation with Google ranking. Specifically, a site with a lower TTFB would be ranked higher on Google SERP (Search Engine Results Page). Metrics like Time to DocComplete and RenderTime were considered to have low or no correlation to the actual ranking by Google.

The third and more interesting study was published by an SEO analyst with a study titled Does Speed Impact Rankings?. In this study the author concludes that:

  • Although speed is important, it is just one of the many ranking factors
  • User interface plays a big role. Create a very minimalistic interface may not help in better google ranking
  • If you’re starting on optimizing, start with TTFB

I like this study since the reason TTFB is considered important is that it is easier to measure and independent of the browser. All other metrics like Start Render, DomInteractive, etc rely on specific browser implementations. So optimizing the metric for say Chrome may or may not impact the actual SERP. However, optimizing TTFB would impact each and every user and bot.

Now, let’s dive a bit more into the nitty-gritties of what matters for SEO.

Sampling Bias?

Although I highly value the studies done by each of the above authors, there are some unexplained issues or problems with methodology. The first study clears hints to this under the section Tail wagging the doc. Let me quote the statement:

Do these websites rank highly because they have better back-end infrastructure than other sites? Or do they need better back-end infrastructure to handle the load of ALREADY being ranked higher?

In other words, all of the studies have an issue where it’s hard to go ignore this:

correlation-causation-comic
(Digressing: If you want to read up on more serious issues related to correlation is not causation issue, head over here: http://www.skepticalraptor.com/skepticalraptorblog.php/correlation-does-not-imply-causation-except-when-it-does/).

To summarize, here’s the issues with the studies:

  • It only looks at one factor like TTFB: Pages from big companies could have higher TTFB but, it could also have very complex page that is favored by Google over similar speed but lesser complex page.
  • It ignores the actual end user behavior: Suppose, there are 2 websites discussing the strategy used by 2 teams in a football match. Site A has detailed paragraphs while Site B has short summary followed by graphs, pictures and images. Site A could start out being ranked higher due to TTFB/start render but over time, Google will learn from user behavior and rank Site B higher
  • It fails to combine the metrics for holistic view: TTFB may impact sitespeed. However, what if TTFB combined with start render and number of images on page is the real reason for higher ranking? The last factor is not even part of the analysis.
  • Sampling bias in research terms: If the researchers had used terms like “tesla”, there is a very high chance that brands associated with the name will rank higher, regardless of site performance. This is simply due to relevance. It is unclear on how well the terms were selected and if they were devoid of any such terms. Even if the search terms are filtered, Google has started to answer questions directly in search result page that ranking may have no impact. eg: If you search sampling bias, there is snippet explaining it and I never have to click on the results at all.

Due to these complexities, I wanted to explore a more robust and continual study and found the research by Moz pretty interesting.

Study by Moz

Moz publishes a study called Search Engine Ranking Factors. They use a combination of correlation data and survey data to get the results. This is quite a unique way of presenting the information.

First, let’s look at the correlation results.

Moz: Correlation data

Since I am at Akamai, my focus is on the metric associated with the Page-Level Keyword Agnostic Features. and Domain level Keyword-agnostic Features. Of this, a few things stand out:

  • Server response time for this URL
  • URL is HTTPS
  • URL length
  • URL has hyphens
  • Domain has numbers
  • Length of domain name and length including sub-domains

Apart from this, bulk of the metrics are related to the actual page content, links to the page, social interactions and mentions.

Now, let’s look at the survey data.

Moz: Survey data

The survey results have many common metrics and a few that are different from the correlation data. Here’s the set of metrics interesting to me:

  • Page is mobile friendly
  • Page load speed
  • Page supports HTTPS
  • Page is mobile friendly (for desktop results)
  • Search keyword match to domain
  • Use of responsive design and/or mobile-optimized
  • Quality of other sites hosted on the same block of IP Addresses

Here’s a few factors that were considered to negatively impact the ranking:

  • Non mobile friendly
  • Slow page speed
  • Non mobile friendly (for desktop results)

In short, mobile friendly / responsive sites are important for both mobile and desktop ranking coupled with site speed. And this makes sense since we’re talking about a well designed site that works on both desktop and mobile and loads fast. And it precludes all the content related optimizations!

Conclusion

After looking at the various research and surveys, it is clear that ultimately, Google or other search engines want to provide results that are relevant and popular. For this, they may be using a lot of ranking factors and it may keep changing over time. Trying to address a single ranking factor could be a very hard game. Instead, web masters should work with content creators to provide relevant content that users actually desire. It should be presented in a way that is pleasing and actionable. To aid in this, web maters could:

  • Build mobile friendly websites
  • Make it fast – across all the metrics
  • Keep it secure and host on a relevant domain
  • If needed, associate with a brand so that people and bots trust the page

All this boils down to Matt Cutts simple statement:

You never want to do something completely different for googlebot than you’d do for regular users.

SEOLium blog post

WebPerformance notes from PerfPlanet

Notes from the best articles on #webperf from the PerfPlanet calendar posts.

Every year, in the month of December, calendar.perfplanet.com invites the experts from Web Performance to contribute their ideas as one blog post a day. It has some very insightful articles and hints at the upcoming technologies.

I went through the articles and made some notes and thought of sharing it for myself and for anyone who is harried for time.

Day 1: Testing with Realistic Networking Conditions

The routes and peering relationships for all of the CDN’s and servers involved in any given content means that it usually works best if you test from a location close to the physical location your users will be coming from and then use traffic-shaping to model the link conditions that you want to test.

In the real world if you over-shard your content and deliver over lots of parallel connections with a slow underlying network you can easily get into a situation where the server thinks data queued in buffers had been lost and re-transmits the data causing duplicate data to consume the little available bandwidth.

Day 2:Lighthouse performance tool

Although lot of tooling is for PWA, it has a command line option and the report is quite useful. Need to try it out.

Day 3: Brotli compression
– Brotli over HTTPS only
– FF, Chrome and Opera only

Day 4: HTTP/2 Push – Everything about Push!
Rules of thumb for H2 push: https://docs.google.com/document/d/1K0NykTXBbbbTlv60t5MyJvXjqKGsCVNYHyLEXIxYMv0/edit

Pushing resource 4 use cases:
– after html
– before html (most interesting, works with CDNs)
– with resources
– during interactive / after onload

Resource hints are cross origin while H2 push is not.

Lots of challenges with push, esp Push+Resource hints. Performance varies between cold and warm connection and degrades with higher latency connections like 2G. Lot more research required.

Colin Bendall’s https://canipush.com is a good resource to test browser capability to accept server side push.

Day 5: Meet the web worldwide

Get data on different aspects of web usage from: https://www.webworldwide.io/
Desktop

Day 6: Measuring WebPageTest Precision

For a desktop experience, the default 9 runs yielded following precision:
TTFB: around 6% to 8%
Other metrics: better than 6%
If 3% precision or better is sought, then 20 or more runs are recommended.
If a 10% precision only is sought, then 7 runs should be enough.
Mobile

For a mobile experience, the default 9 runs yielded a 3% or better precision.
If a 5% precision only is sought, then 5 or more runs should be enough.
If a 10% precision only is sought, then a single run should do.
Day 7: Progressive Storyboards
It’s important to ensure that the sequence in which features are revealed to the user is natural and follows user’s incremental needs as they wait comfortably for all the information and interactive functionality to be shown.
1. Verify Destination
First step of any web navigation from one page to another or interaction within website interface is to assure the user that the action they performed was the one they intended to do so they can comfortably wait while it loads without wondering if they need to click the Back button.

2. Provide primary content
3. Allow interation
4. Show secondary content
5. Below the fold
Day 12: Prefer DEFER over ASYNC
Async will not block HTML parsing but it will block rendering. So prefer DEFER over ASYNC when possible.

Day 17: Rise of Web Workers
Web workers may be used to handle all the non-DOM manipulation tasks, especially when being used in a frame-work based environment like React and Angular. Needs more research though.

Day 20: Font-face syntax optimizations
Remember that browsers use the first format they find that works—so if you don’t order them correctly the browser could waste resources (or render poorly) with a less-than optimal format.

@font-face {
 font-family: Open Sans;
 src: url(opensans.woff2) format('woff2'),
 url(opensans.woff) format('woff');
}

First use woff2 and then woff. Doing this eliminates older browsers but, they’ll still see the content on system defined fints.
Day 24: A tale of 4 caches
There are different levels of caching on the browser and the behavior may vary based on the way it was loaded. eg: preloaded object may not persist across navigation as compared to a prefetched object. The location of cache also has implication on whether an object shows up in the Developer tools.

Day 25: Root Domain issues with CDN
The ANAME/ALIAS is resolved by your DNS provider’s nameserver instead of by a recursive resolver (ISP, Google Public DNS, OpenDNS or other) and this may lead to end users being routed to a far-away CDN node and consequently getting a poor experience.

Day 26: PNG Image optimizations
PNG optimization using pngquant and zopflipng

Day 27: HTTP Push and Progressing JPEGS
Progressive images with HTTP/2 should help in renderimg images faster. To optimize it further, consider changing the scan settings at the image optimization work-flow stage to improve on the SpeedIndex.

Day 28: Links to interesting posts

There are a lot of articles mentioned in here. I am going to focus on these 2 for now:

Useful SEO Tags – Part 2

Meta tags are the backbone of SEO. I explore the usage of these tags in the Alexa top 1000 websites in this post.

In the first part of the blog, I looked at the pattern of use for the  tag. In this blog post, I’d like to focus on the meta tags. If you’d like to see all the possible options, please refer the meta-tags website.

Meta Tags that matter

Moz has an excellent article by Kate Morris on the set of meta tags that matters. In this, it lists just 2 tags as essential. These are:

  1. meta-description
  2. meta-content-type

There are bunch of others that are listed as being optional and should be used only if the default behaviors have been changed.

With this knowledge in hand, I wanted to check if the Alexa Top 1000 websites are sticking to the best practice or if there is a heavy use of the tags with no real returns.

Meta-Tag distribution

Let’s start by looking at the top 25 meta-tags being used. I have published the entire distribution here: https://gist.github.com/akshayranganath/ad0b170550714e2a77612bf0f81057da.

Attribute Count
content 14574
name 6205
itemprop 4316
property 3301
http-equiv 1219
charset 439
itemscope 214
itemtype 214
itemid 212
data-reactid 161
id 94
class 47
data-app 42
value 37
data-type 36
lang 23
data-react-helmet 17
data-ephemeral 17
data-dynamic 13
xmlns:og 12
scheme 11
data-page-subject 7
xmlns:fb 7
data-ue-u 7
prefix 3

meta_distribution

This is a simplified check since some use of the meta tag is by a combination of values. For example, the meta-description would look like this:

<meta name="description" content="This page is on SEO stuff.">

My first version of the script is does not capture this dependency but, I hope to add that capability over the next few weeks. At a very high level, it looks like most sites do follow the best practice. Schema.org tags are being heavily used and this makes sense due to the growing importance and the ability to control the behavior of results in SERPs.

Strange Use-Cases

I did see the meta tags being put into use for some strange uses. For example, and are not even required but, are quite heavily used. Here’s a use that seems to make absolutely no sense

<meta content="id" name="language" />

Conclusion

Meta tags appear to have been used by the top 1000 sites in the right intended manner, for most part. I plan to re-visit this and explore the usage oftag and break into the usage pattern. Stay tuned.