Link Building

ILO framework is a large-scale automated internal link optimisation pipeline developed by Dejan Marketing. The system is designed to generate link recommendations by finding the most similar pages on large websites. If you have a website with 100+ pages this solution is ideal for you. Our current pipeline can natively handle websites with up to 1,000,000 pages. For websites with 1M+ pages a custom data ingestion pipeline may be more suitable.

  • Customisable crawling and data extraction
  • Domain-specific pre-processing
  • Language-agnostic BERT sentence embeddings
  • Efficient similarity search
  • Designed for large-scale
  • Multilingual
We were given our very own bespoke internal link recommendation engine that leverages world-class language models and data science. It's one thing to theorize about the potential of machine learning in SEO, but it's entirely another to witness it first-hand. It changed my perspective on what’s possible in enterprise SEO.

Workflow

System Design

In consultation with your team we design a highly customised strategy for your internal link optimisation project. Numerous factors are taken into consideration including your goals and objectives as well as technical aspects of your setup.

  • How many domains do you have?
  • Are there any subdomains?
  • Which domains, subdomains and site sections are of interest?
  • Which HTML element is content best extracted from and for which page type?
  • Are there any explicit inclusions or exclusions in the linking logic?
  • How do we handle newly discovered URLs not in your sitemap?
  • What to do with redirects and crawl errors?
  • What is the maximum crawl rate for your site?
  • Is your website JavaScript based?

The list goes on, so a careful consideration of your specific setup and objectives is essential in preparation for the crawling, content extraction and link recommendations.

Crawling

We typically start from your sitemap files and use it as a starting point for URL discovery and crawl list generation. Additional URLs can be added from an arbitrary number of sources, including programmatic generation. User-agent based crawler then proceeds as either a single or multi-threaded process as gently or as aggressive as we design it to be.

Raw HTML data is saved during the crawl awaiting further processing.

Data Extraction

Raw HTML is then processed to extract meaningful and clean data. This typically involves text extraction from article/body or custom ids and classes taking care to exclude any boilerplate elements such as nav, sidebar and footer.

The system also finds and maps all your internal links, generates a link graph and calculates internal PageRank. This provides a more nuanced insight into internal page connectivity and can further inform link recommendation strategy and fine-tune the final output. Internal page authority is also useful in creating before and after optimisation projections and linking them with your business goals.

The final outcome of data extraction is:

  • Clean text
  • Page meta data
  • Internal link graph, anchor text and PageRank

BERT Vector Embeddings

Pre-processed and tokenised text is then converted to multi-dimensional vectors as language-agnostic BERT sentence embeddings. This enables similarity searches among pages in any major human language making it suitable for multilingual websites with complex alternate hreflang setups.

Similarity Search

In this stage we employ cosine similarity to generate a similarity matrix and are able to produce an arbitrary number of link recommendations in a many to many scenario. This maps all similar pages in the entire dataset.

Prioritisation

Your similarity matrix holds many thousands, even millions of link recommendations and so it’s important to carefully prioritise the rollout of implementation by focusing on heuristics to find high-impact link recommendations and filtering out the rest.

Examples rules:

  • Only link higher to lower PageRank URLs
  • Exclude pages which already link to each other
  • Prevent linking between pages that are too far apart in authority
  • Suggest links only between certain topical clusters

Note: We do not recommend automatic-linking, however LLM-based automated evaluation is an option.

Final Recommendations

You receive a spreadsheet of all link recommendations, prioritised and sorted for implementation. Our work isn’t finished at this stage however, as we stay with you offering advice, guidance and assistance during implementation and help you measure and report on impact of your internal link optimisation project.