The IPv4 full BGP table size is at around 725000 prefix now. This may cause problems for companies who do not have the resources to update or upgrade their edge routers.
But, except for Internet transit providers, who does really need to get the full IPv4 BGP table today? And what are the alternatives? Let’s see that in details with some use-cases.
No BGP
First example, if my company only need Internet access, without providing any service, I do not need BGP at all. A default-route, pointing to my ISP is enough.
If I want more redundancy, then I can take two different ISP. But here also, BGP is not always needed:
If I want to make some load-sharing between the two links, I can do it with NAT for example: part of my outgoing traffic towards the Internet could be “NATed” with my ISP-A and another part with ISP-B. The return traffic will come back by the same way.
Now, what if my company have on-site servers, who need to be reachable from the Internet? Again, with a single ISP there is no problem, I can do this with NAT too.
But, with two or more uplinks, this is more complex. A smart combination of different DNS entries and NAT/PAT can be used. Or a load-balancing device with the different uplinks connected to it too. Or, we can also use dynamic DNS and NAT/PAT if the different ISP provides only dynamic IP addresses.
But here, we reach the limit of an acceptable design: the redundancy may be not always guaranteed if one ISP goes down. For instance, some part of the Internet could be able to reach my servers and some part not. And finally, the troubleshooting of a solution like this may be quite difficult.
Multi-homed BGP with default-routes
In the case my company have a public IP subnet, provider-independent or provided by one of the two ISP, and need at least two different ISP to achieve redundancy, the best way is to use BGP. But, nothing complex here: my company announce its IP prefix to both ISP, and both ISP announce a default-route to my company’s edge routers. Here’s a simple drawing:
In this situation, If the ISP-A fail, or the link to ISP-A fail, my company prefix is not announced any more to them (or by them, depending where is the failure). Then, the Internet traffic from/to “my_prefix” goes through ISP B.
This is a very common solution to achieve redundancy. The problem with this solution is the lack of control where the traffic from and to the Internet will go. In fact, the egress traffic from my company (my_prefix) to the Internet will go through only one ISP, except if you use bgp bestpath as-path multipath-relax with Cisco devices or an equivalent feature. I explained this feature in details into this post.
And the ingress traffic path will depend on both ISPs traffic engineering, traffic policies and connectivity to the rest of the Internet.
For example, if ISP-A is a tier-1, with an excellent connectivity to the world and ISP-B is a very local services provider with only one upstream provider, there is a good chance that most of the ingress traffic comes from ISP-A. Unless the clients of my company are all connected to ISP-B.
So in this case, the choice of the different ISPs is crucial. Not just the price but also the connectivity to my potential customers or partners, as well as the connectivity of the others ISP. Imagine if the ISP-B have only one upstream provider towards the Internet, and this is ISP-A …
Of course, we can influence the ingress traffic with different BGP attributes, like AS-path prepending for example, MED in the case of multiple links to the same ISP, or with egress communities tagging if the upstream provider is providing traffic-engineering possibilities. Other solutions could be announcing more specific prefixes to one or the other ISP (de-aggregation). Or, of course, with a combination of multiple of these parameters
But, we still have almost no control on the egress traffic. The options here are one ISP active / one ISP backup, or both active with a load-sharing method explained here.
Multi-homed BGP with full-routes
To have more control on the egress traffic, the default choice since many years is to build a BGP multi-homed network and getting the full BGP routing tables from the different upstream providers. Like this, we receive the entire routing table into the routers we control, and we can build our own traffic policy for the egress traffic.
The problem: BGP tables are growing
The problem with full BGP tables is the IPv4 and IPv6 BGP tables sizes are constantly increasing, and this have a significant impact on your routers.
The IPv4 space fragmentation is the direct repercussion of the IPv4 address exhaustion: we see more and more /24 subnets into the BGP tables.
The IPv6 table is growing more “naturally” – around 53000 at the time I wrote this article – following the deployment of new IPv6 prefixes across the world.
The problem of the limited FIB capacity and growing BGP tables size is not new; multiple outages were reported in 2008, when the BGP table size crossed the 256K limit, and again in 2014 when the 512K entries limit was exceeded. A good article here describes the problems experienced in 2014.
The 768K limit
In summary, the BGP advertisements received by a router are processed and inserted into a table called Forwarding Information Table (FIB). This FIB have a maximum limit of entries, determined by many factors like the amount of memory, the hardware (ASICs) and sometimes also the software. And not all vendors support dynamic allocation of FIB entries between route-types (IPv4, IPv6, MPLS). For example, a router may support 1 Million routes in total, but it is limited to only 756K prefixes for IPv4 and 256K for IPv6 prefixes.
As the IPv4 table is very close of the 756K, there is a good chance that the problems experienced in 2008 and 2014 will recur very soon.
Moreover, if we look at the extremely slow deployment of IPv6 across the world, the partitioning of the IPv4 table may increase again and again. At the end, the BGP IPv4 full-route solution explained above may be a very expensive solution. And how will you, as network engineer or IT manager, justify the cost of multiple upgrades of your edge routers, just because the Internet is growing?
Multi-homed BGP with partial-routes:
The good compromise
So, the question is, do we really need all the IPv4 Internet prefixes? Do we need to always use the shortest AS-path to reach one destination at the other side of the world? For many companies, the answer is clearly no!
A good compromise between the most important routes and the default route seems the most scalable solution. I personally implemented this solution to a customer few months ago. I will explain it below in details.
Customer use-case
My customer wanted to have a good global connectivity, an excellent local connectivity, and not spending too much money on edge routers.
My solution was this: take two links to tier-1 IP transit providers, both with an excellent global connectivity, and take two others links to two national ISPs, those with the most residential and business customers, to have the best connectivity within the country. Then, filter the prefixes and accept only the most interesting routes received by them, plus the default-routes from the two tier-1 ISP.
National connectivity
From the two national providers, no need for default and upstreams routes, I kept only the prefixes of the ISP and its customers.
One of the two is using community tagging to designate where the prefix come from (upstream, customer, himself) so for this one the filtering was easy.
For the second one, some reverse lookup on the RIPE database gave me all the needed information to build a good AS filter. Yes, I agree, this is a static filter for the moment, but I am looking for a more dynamic solution.
Global connectivity
For the two tier-1 ISP, we made some experiments with the customer and first decided to keep the provider’s own prefixes, plus the first AS behind. In addition to that, we keep also some specific AS, in relation to the business of this customer. A simple BGP AS reg-exp filter is used for this.
The results
The results were excellent! The IPv4 routing tables was composed of this:
- Between 2’000 and 5’000 IPv4 prefixes for the national providers.
- Around 125’000 IPv4 prefixes for the first tier-1.
- Around 250’000 IPv4 prefixes for the second tier-1.
So, a total of less than 400’000 IPv4 prefixes in total.
This is still a relatively high number, but we can still filter more prefixes depending on the customer needs. We just started a long-term traffic analysis, to fine-tune the AS filters.
The default routes
And about the prefixes we do not have in the routing table?
For theses, this is simple, we asked the two tier-1 ISP to also provides us a default-route, and we load-share the traffic between the two with the “bgp bestpath as-path multipath-relax” feature explained on this post.
In summary
- Regional egress traffic is going to two regional ISPs.
- Global egress traffic to the shortest AS-path tier-1 ISP.
- The rest of the egress traffic is load-shared between the two tier-1 ISP.
- The ingress traffic comes from the shortest AS-path of the 4 ISPs.
- If any of the ISP goes down, the redundancy is assured.
- We reduced the IPv4 table size by nearly 50% and we can do even better.
One last important point: when a router receives the full BGP table from an upstream, after the establishment or re-establishment of the session (after a hard reset), even if the router filter more than 50% of the full BGP table, the entire table is still processed by our router (all the BGP update messages) and then filtered. This must be considered when you do the sizing of the router.
ORF (outbound route-filtering) could be a very good solution to solve this, but I don’t know if any ISP agree to use this with one of his customer.
The other solutions could be to ask your upstream provider to announce only what you need, but this requires administrative work and delays when we need to change a filter.
Conclusion
With the actual growing of the BGP IPv4 routing table, partial routing is an excellent solution. More intelligent and sophisticated solutions in this area are also coming. For example, a SDN controller can make real-time analysis of the prefixes received by the peers, and dynamically inject only the most interesting into the edge routers.