S3E8 Robin Allenson, Deduping Large Websites

Keira Davidson (00:24):

Hello, and welcome to the TechSEO podcast, which is hosted by myself, Keira Davidson, a senior SEO executive at SALT. This episode is with Robin Allenson, who is the CEO of Similar.ai. How are you today?

Robin Allenson (00:43):

Keira, thanks for having me. I’m feeling great today. How are you?

Keira Davidson (00:47):

I’m good, thank you. It’s quite nice. It’s quite sunny out today. We’ve had a month’s worth of rain. I don’t know about you.

Robin Allenson (00:53):

Yeah. Well, I’m in Holland. It’s sunny today. It was sunny yesterday. I’m just bracing myself for the next 11 and a half months before we get summer again.

Keira Davidson (01:03):

Oh, wow.

Robin Allenson (01:05):

No, I’m joking.

Keira Davidson (01:07):

I had a little bit of looking into you and noticed that you’ve been in the industry for a little bit of time, and you’ve worked at some really interesting places, such as Yext, and you are now a CEO. How did you initially get into the industry?

Robin Allenson (01:29):

Way back when, a long, long time ago, I used to help a company called Red Fish make websites. And gradually, after that, we did a lot of different online work, but this is, oh, 20 years ago. And then, around 2009-ish, I started working at Yellow Pages and we built search engines for Yellow Pages Online to work on, and, as part of that work, I started helping Yellow Pages out with SEO. Actually, so Yellow Pages, I guess, was more 2005 or so. After that, I left Yellow Pages and started a new startup called InnerBalloons, which is a terrible name for a startup, and I won’t recommend that to anybody, but it was an anagram of my name.

Robin Allenson (02:14):

And then, InnerBalloons, we did a bunch of different things, mostly focused on SEO for verticals, and also conversion rate optimization for verticals, and then we gradually pivoted towards selling presence management and reputation management, and we used a lot of AI to build out a presence management product, so local SEO. And then we got acquired by Yext at the end of 2014, and so then I came on board to help them kickstart their operations in Europe. And then I left Yext a year later, because I think they wanted me to manage the existing business and I was interested in building out something completely new, and then, shortly after that, started similar to Similar.ai. And so we’ve done a bunch of different things as well at Similar.ai, but we’re really focused right now on how we scale SEO through APIs for marketplaces.

Keira Davidson (03:22):

Oh, that’s really exciting. I think marketplaces, from my point of view anyway, are often left behind. There’s so many opportunities and so much to learn from them from a technical perspective, but they’re usually just forgotten about.

Robin Allenson (03:40):

Mostly, what you see, there are in-house SEOs, product managers are working with a team of developers at a marketplace. Often, those developers are tasked with doing lots of different things for the marketplace, it’s not just SEO, but, yes, there’s a lot of opportunity. We talk about an inside out versus an outside in way of thinking about a site. Inside out is you take all the product taxonomy you have and, often, these sites, they have an existing site taxonomy, which is quite old, and it doesn’t always match up to how users search, but they just take all the different combinations of that site taxonomy, and so you might see a lot of things that look like faceted navigation, all the possible attributes, and you might also see pages that they could create that would match up to how users are searching now, but aren’t in the site taxonomy, which was created a decade ago.

Robin Allenson (04:46):

And so there’s a lot of missing opportunities, both too much, too many pages, and I think a lot of sites successfully took a what I call a spray-and-pray approach. You take all the possible combinations of product attributes and you turn those into pages and you get these incredibly long-tail pages. And some of those ranked and some of those got traffic, right, and a lot didn’t. And, after a while, after a few years, Google starts to get fed up with investing time and resource in crawling all those pages because many of those pages don’t match up to demand and many of those pages don’t have unique content, so they’re not sought after and they’re not interesting.

Robin Allenson (05:28):

Those are the problems that you often see, but they’re both problems with too many pages and they’re problems with lacking pages that match up to demand, so they’re missing pages. Site depth is a big problem. You have incredibly deep internal linking that’s not actually linking to pages that users would love, and a lot of those are just enormous long lists of listings. That can look a bit spammy to Google. Adding a small amount of content on those [inaudible 00:06:00] pages to orient the user could be enormously powerful. Those three problems are the things that Similar.ai is uniquely focused on.

Keira Davidson (06:09):

Yeah. And they all relate around duplicate content, duplicate pages, poor value. They don’t really provide much of a benefit to organic performance. If anything, they’re hindering it because they’re diluted link equity, they’re unoptimized crawl budget, just crawl bloat. There are so many issues that can be caused by what you just mentioned. And we briefly touched upon, before we started the podcast, about how, well, there are different approaches that you can take when it comes to deduping large sites. And, for example, if a site only has, let’s say, 100 pages, it’s quite easily to just manually do it. It might be that the product has five different alternatives and you only have 25 products. It’s not going to be horrendous to do manually. Whereas, when looking at a site that’s got, let’s say, hundreds of thousands of pages, maybe millions, it’s just not possible. No human is capable of going through each … and analyzing the data and determining what’s duplicate, what’s not, within a reasonable timeframe.

Robin Allenson (07:35):

That’s it.

Keira Davidson (07:37):

How would you approach this?

Robin Allenson (07:41):

That’s a great question. What we’re doing is really building software as a service product to be able to handle this at scale. And we split up the cleanup of superfluous pages, pages that aren’t adding a great deal of value, into two types, and they definitely overlap as well. This is the cleanup part, but we also do … we add internal linking and we add automatically-generated dynamic content that’s based on how users search. We call that user-centric content. Those are other pieces I won’t be going into as much in depth, but adding content that matches up to how users search, it matches up to search demand, is very powerful, also, for making the difference between pages more explicit. And so our goal with the whole platform is to make every single site and every single page on the site match up to a sought after user intent for which the site has answering content.

Robin Allenson (08:54):

And so, for most small sites and large sites … and so you were mentioning marketplaces with maybe more than a million pages, one of our larger customers has 600 million pages. Another customer has more than 50 million products. These are just gargantuan sites, and so we split the cleanup work into two parts. One is what we call classic duplication, so very similar to keyword cannibalization, except rather than at a keyword level, we’re looking at all the pages which answer the same user need. And so you might have 100 keywords or even 1,000 keywords which all basically answer the same user need, and sometimes people without search intent, other people would call that a keyword cluster or a topic, and we look at all the pages that target the same user need. We work out which is the best page on the site, and then we offer a file which lists out all the weaker pages and the single canonical stronger page.

Robin Allenson (09:58):

And then the site owner, so, typically, the growth product manager or the in-house SEO, decides, are we going to 301-redirect those, are we going to add a canonical tag, which is mostly for marketplaces, that’s probably too weak a hint, but that can work, especially if, on the site, users will actually see some of those redirected pages and they might navigate to those on the site themselves. You’re going to have to remove the internal linking to that. You’re going to have to make sure people don’t accidentally go there except through Google. But if you are unable to do that for some reason, then maybe a canonical tag will be fine. Other growth managers would actually prefer to add 410s, so a discontinued product way of thinking about it, just because there’s no link equity there.

Robin Allenson (10:55):

The pages that we’re deduping, they don’t have traffic, yeah, they don’t have demand typically. We’re looking at the vast majority of pages that really don’t have any chance of ranking and we’re trying to clean those up first. That’s the first target, so pages which share the same user need. And then the second group is pages which don’t target enough demand, and so there’s an overlap there. Sometimes those pages really aren’t interesting either, but what we’re doing is we’re matching those pages up to topics or to user needs and work out the total demand for the user need. And, if that’s less than some threshold level, say, 100 searches a month in the U.S. last month, then we … oh, I’m being spammed.

Keira Davidson (11:54):

Don’t worry.

Robin Allenson (12:00):

If that’s less than some minimum threshold, then we flag that page as being what we call a page to hide, and so it’s just something that you wouldn’t like to surface to search engine users or to search engines. And, again, there are some different approaches to hiding that. You might just orphan those, you might just remove that from the site map, because that’s typically search results pages that haven’t had traffic, that aren’t organically ranking, but also, more importantly, they have no chance of ever getting traffic. Even if they rank really well, they’re only going to bring in a dribble over time. Some really big sites get a ton of their traffic from really long-tail pages, so you might want to set that threshold much lower, but we look at both of those as a cleanup, so both deduping in the classic sense and also hiding pages.

Robin Allenson (12:57):

And so, under the hood, we’re building for a certain [inaudible 00:13:04] to say this is a fashion marketplace, we’re in the U.S., we’re building a transactional universe of all the possible keywords with which people could search for fashion in the U.S., and we’re pulling in all the keywords that the site ranks for as seen through Google Search Console API. And so, pulling in through the API, you get a much greater granularity than you can see in the search console’s user interface, so we merge those together. You’re typically talking about millions of keywords. And then we cluster those, we group those by user need. And so, for a single topic with the same user need, you could have five or 10 or a 100 or 1,000 individual keywords.

Robin Allenson (13:50):

We’re adding up all the search volumes. We’re getting rid of some of the search volumes that are incorrect. Misspelled keywords often have the same search volume as the correctly spelled ones, so we detect that and we remove those. Sometimes there are other aberrations in the search ones you get. We work out what the actual real topic demand is and then we use that as an underlying platform both for the deduping based on user need and also working out which pages are pages to hide. Sorry, that was a long explanation, but that’s how the cleanup part of the platform works.

Keira Davidson (14:29):

One thing that stuck out to me during that was using a 410 to highlight to search engines that it no longer exists, and I hadn’t thought about that. I just thought of canonicalizing, redirecting. That’s really interesting. I’ve actually made a note of that.

Robin Allenson (14:51):

Yeah. I think there’s four, and so the 410 was new to me too. I think it’s one of the joys about working and focusing on marketplace SEO is that we get to work with some of the most brilliant in-house SEOs, and they’ve been thinking long and hard about how to do this, right? And so, often, what we see is lots of these marketplaces have in-house efforts to do deduping, but they find the engineering effort is considerable, so they often do that one-off, or they’ve done it with an agency and they’ve done it for some parts of the site. And then they have a lot of thought about how to approach this and sometimes depending on how bloated the index is and how many of those pages they really have to get rid of.

Robin Allenson (15:48):

Some of the sites we’re working with are working through hundreds of millions of old pages. And so you can clean up a lot and still not see a great deal of benefit because they have a lot of backlog, right? They’ve been doing this for a decade, right, and there’s a lot of stuff there. Some of them just say, “Look, these pages, they have zero link equity, we have actually no interest in them, so all we’re going to do is we’re going to add a 410 and we’re going to add that to the site map, and when Google crawls that, then they’re going to remove that, and that’s going to be effective.” Others say, “Look, there’s still something there. The long-tail pages are super important to us. So we’re going to redirect to a relevant page.”

Robin Allenson (16:24):

And so 301, I think, is the default approach we see in, I guess, 70% of customers. 410 has come up recently, so we have that connotation. Others say, “We’re just going to orphan pages. We’re just going to remove internal linking and remove it from the site map. We don’t really think most of these pages are in the index anyway. We’re just going to ignore that.” You see some different things, but it’s cool because you see a range of different … I was exactly what you said, so I thought, “You’re going to 301, right, or add a canonical,” but, if you have a few hundred million pages, I think canonical, it just seems like too weak a hint. That was all I thought of, but I continue to learn from our customers.

Keira Davidson (17:18):

Yeah. My mind’s now just still thinking about 410s, about how they’ll naturally just fall out of the index, and how actually, initially, when I was thinking of it, it was like, oh, that is a really good … imagine a Band-Aid, you initially put it on, and then, if you later wanted to revisit it, you could. You could put in a different approach, but there actually won’t be a need to do another approach-

Robin Allenson (17:41):


Keira Davidson (17:41):

… because it actually will solve it further down by search engines no longer needing to crawl the page. They’ll naturally just remove it from the search. I actually quite like that approach now.

Robin Allenson (17:58):

When we talk to clients who don’t have a great deal of knowledge of SEO, then they’re often very interested in adding things. They want to add more linking, they want to add more new pages, and often there are great opportunities to drive traffic that way. But the more experienced the SEO, and a lot of the in-house SEOs you work with at marketplaces are extremely experienced, yeah, they say, “We can’t get to all that interesting stuff of adding pages and adding content until we’ve cleaned up our debt, right?” Our programmers talk about technical debt. We’ve got this enormous page debt that we need to clean up first and there’s just a lot of bloat, right? We’ve got to work on subtracting pages first and then we can focus Google’s attention around … we can put more wood behind less arrows, right? And so that’s the number one and two and three goal. Yeah, then these kind of approaches can work well, but it was completely new to me, as well, until a few months ago.

Keira Davidson (19:02):

And you briefly mentioned, obviously, if you orphan the page, you’ll want to remove the internal links. Do you know if you’re, let’s say, 301-redirecting of that 410 and you’re probably going to want to point the internal links straight to the end page instead of creating redirect chains. Do you have a way of almost automating that instead of having to go in manually one by one? Can you create a rule or something?

Robin Allenson (19:31):

Yeah. That’s a great question. We crawl all the pages. We build out this intent universe or topic universe or keyword universe that’s got all of the intents. It’s interesting, actually, for a big domain, you might have millions of keywords, but, typically, you’d have 10 of thousands of user needs, so that actually comes down to a surprisingly reasonable manageable number, except that many of those user needs have tens of different keywords that can express that same need. And then we match those up to the pages and we work out, for pages which share the same user need, what the strongest page is. And then, typically, we’re recommending 301-ing to those strongest pages.

Robin Allenson (20:24):

But sometimes we’ll start by only looking at pages which don’t rank or only looking at pages which don’t get traffic, and then thinking more carefully about if you have … We have some examples where there’s 10 pages, which all target the same user need, and so, less for marketplace … in marketplaces, you’d see more things like … it depends. Some marketplaces are really focused on … they have a really strong internal taxonomy, and they used that without thinking are there users who are looking for this combination of attributes? Other marketplaces have used on-site search as a proxy for Google search, and so they create pages internally at different places within the site taxonomy and with different tags, which correspond to what users have typed in.

Robin Allenson (21:17):

But users, just like you see if you look at Google Search Console data, users misspell things all the time, right, and they’re searching in different places on the site, so we might have 10 or 20 pages all targeting the same … I’m thinking about an automotive example. There’s a Volkswagen Polo, and then there’s, I don’t know, five different misspellings of Volkswagen because, mostly, car brands and makes and models, they’re foreign words for everybody, and people misspell foreign words, so people misspell those. Polo, you’d think it’d be simple, but you’d be amazed, right? There’s lots of misspellings. And then, also, there is a place on the site for the Volkswagen Polo, but there are lots of other pages that are created elsewhere on the site.

Robin Allenson (22:05):

We crawl all the pages, all the [inaudible 00:22:08] pages. We’re pulling them all in and we’re matching up to what topics they match up to or what user needs they match up to and then we’re identifying which user needs are common, and then we’re trying to work out what is the right page to redirect to. We also have pages which don’t have any demand, where, again, if the client would like to 301, then we need to find a relevant page that has demand, that has listings or has content, to which we can redirect those pages, and sometimes there’s going to be overlap in those. And, when there’s overlap, we need to be very careful that we don’t create a chain or we don’t create loops, as an example. We’ve built software that does that.

Robin Allenson (22:52):

Before we create the export … really, the way it typically works, the marketplace builds a very simple service on their side to do a 301-redirect based on a bulk file of weak pages going to stronger pages, and then we give them all the information of, “Here’s the list of weak pages and the stronger page to which you should redirect them or to which you should add a canonical tag,” or whatever that might be. The actual implementation on their side is very simple. And then, on our side, we’re doing the heavy lifting of working out which pages are tied to the same user need and how we check for overlaps.

Robin Allenson (23:31):

And then we make sure, for the whole of the site … and so, for a small implementation, we might be talking about millions of tens of millions of pages, right, so it’s pretty high scale, but then, yeah, we’re checking that there aren’t chains or loops in there before we hand that file over. We’re typically doing that in bulk so that, every day or every week or every month, we could hand over a new file and we could make updates to those things or remove redirects that were there.

Robin Allenson (24:03):

That’s part of the advantage of doing that in an automated way. The disadvantage is it’s insanely complex, but the advantage is, well, it’s going to be pretty complex anyway, but now we can actually enforce that with software. Because we think doing deduping at this level, it’s not the kind of optimization problem for humans to execute, but that’s a machine problem, that’s machine learning, that’s a software problem, but you still need people on the other side to work out, well, should we be doing a 301, should we do that for pages with traffic? There are lots of permutations to think through the strategy, and I think that’s the kind of thing that’s not easily codifiable, that you need amazing SEOs for, but going through page by page and working out does this match up to the same intent, yeah, I think that’s software and AI work that the SEO should hand off.

Keira Davidson (25:01):

Yeah. That’s so interesting. My mind is just going 100 miles per hour just thinking about all the different possibilities and different approaches. It’s really interesting. I really appreciate you joining me today. I’m definitely going to be looking into this a little bit after we finish up here today.

Robin Allenson (25:22):

Happy to give you a demo one-on-one, and, if there listeners who’d like to see how this works in practice, always happy to show them around what we’re up to.

Keira Davidson (25:31):

That’s perfect. Thanks so much. Really appreciate you taking the time out of your day to join me and thank you so much.

Robin Allenson (25:40):

Yeah. Keira, it’s been wonderful. Thanks so much for inviting me.

Keira Davidson (25:42):

No worries.

Join the discussion