It is currently Tue Dec 12, 2017 12:20 pm


All times are UTC




Post new topic Reply to topic  [ 26 posts ]  Go to page Previous  1, 2, 3  Next
Author Message
 Post subject: Re: [PATCH] Fix for packet looping on platforms with Gt96k F
PostPosted: Thu Feb 27, 2014 9:25 pm 
Offline

Joined: Wed Feb 16, 2011 5:07 pm
Posts: 114
it seems the issue i reported might be somehow related...

when the traffic stops to pass all together, i looked at the interface and noted output drops too:

Code:
R2#sh int fa0/1
FastEthernet0/1 is up, line protocol is up
  Hardware is Gt96k FE, address is c201.1d80.0001 (bia c201.1d80.0001)
  Internet address is 10.0.1.2/24
  MTU 1500 bytes, BW 100000 Kbit/sec, DLY 100 usec,
     reliability 255/255, txload 1/255, rxload 1/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full-duplex, 100Mb/s, 100BaseTX/FX
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input 00:00:07, output 00:00:08, output hang never
  Last clearing of "show interface" counters 00:00:09
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 6
  Queueing strategy: fifo
  Output queue: 40/40 (size/max)


here i went and "hacked" configuring the highest tx-ring-limit as a workaround:

packets eventually start flowing but later on they are tail dropped in output queue as well...

i'll try to see if i can compile dynamips with your patch, hopefully that solves my problems too




Top
 Profile  
 
 Post subject: Re: [PATCH] Fix for packet looping on platforms with Gt96k F
PostPosted: Thu Feb 27, 2014 9:50 pm 
Offline

Joined: Wed Feb 16, 2011 5:07 pm
Posts: 114
bad news... the patch still doesn't fix my issue... but it might be somehow still related! :(


Top
 Profile  
 
 Post subject: Re: [PATCH] Fix for packet looping on platforms with Gt96k F
PostPosted: Fri Feb 28, 2014 9:09 am 
Offline

Joined: Sat Feb 22, 2014 9:47 pm
Posts: 11
Hi,

Do you think you could provide us with the exact topology and the configurations you are using, along with a guide of reproducing your issue? I can not promise I will be able to find a solution - I am generally a bad programmer - but I may at least try.

By the way, I've sent you an e-mail. Looking to reading from you soon.

Best regards,
Peter


Top
 Profile  
 
 Post subject: Re: [PATCH] Fix for packet looping on platforms with Gt96k F
PostPosted: Fri Feb 28, 2014 9:45 am 
Offline

Joined: Wed Feb 16, 2011 5:07 pm
Posts: 114
this is the lab:

https://dl.dropboxusercontent.com/u/665 ... ticast.zip

topology is simple:

R1 --- R2 --- R3

R3 has a loopback and joined group 239.1.1.1 , all interfaces are running PIM Dense mode:

all routers are running static routing to keep as simple as possible ...

config is :

Code:
ip multicast-routing
!
interface FastEthernet 0/0
ip pim dense-mode


from R1 just do multicast pings to 239.1.1.1 (i do about 1000 so they keep running), after about 40/50 pings, all traffic stop flowing ... the culprit is ether link between R1 and R2 or R2 and R3 ...

where the traffic stops flowing you'll see tail drops because the output queue of the interface is full..

p.s. Ano, bdlym v Brne ale nejsem czech .. cestina je moc moc tesky jazyk, tak mozna je lepsi mluvit anglicky :D (p.s. jsem ital)


Top
 Profile  
 
 Post subject: Re: [PATCH] Fix for packet looping on platforms with Gt96k F
PostPosted: Sat Mar 01, 2014 12:40 am 
Offline

Joined: Sun Feb 27, 2011 1:53 am
Posts: 32
I am experiencing the same issue as anubisg1 even with the patch provided by paluchpeter. After some troubleshooting I have discovered that ALWAYS the next hop router's outgoing interface will queue traffic after 50+ continues pings

The below issue is demonstrated by using anubisg1 provided topology and config.

Code:
R1#ping 239.1.1.1 repeat 100

Type escape sequence to abort.
Sending 50, 100-byte ICMP Echos to 239.1.1.1, timeout is 2 seconds:
..
Reply to request 2 from 10.0.1.1, 48 ms
Reply to request 3 from 10.0.1.1, 56 ms
Reply to request 4 from 10.0.1.1, 52 ms
Reply to request 5 from 10.0.1.1, 60 ms

Replay to requst x from 10.0.1.1, 60 ms

Reply to request 69 from 10.0.1.1, 60 ms.....


And then when you check R2's f0/1 interface that is connected to R3 you will see that packets are increasing in output hold queue.

Code:
R2#sh interfaces f0/1 | inc drop|qu
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
  Output queue: 6/40 (size/max)
     2 unknown protocol drops
R2#sh interfaces f0/1 su
R2#sh interfaces f0/1 summary

*: interface is up
IHQ: pkts in input hold queue     IQD: pkts dropped from input queue
OHQ: pkts in output hold queue    OQD: pkts dropped from output queue
RXBS: rx rate (bits/sec)          RXPS: rx rate (pkts/sec)
TXBS: tx rate (bits/sec)          TXPS: tx rate (pkts/sec)
TRTL: throttle count

  Interface              IHQ   IQD  OHQ   OQD  RXBS RXPS  TXBS TXPS TRTL
------------------------------------------------------------------------
* FastEthernet0/1          0     0    8     0     0    0     0    0    0
R2#sh interfaces f0/1 summary

*: interface is up
IHQ: pkts in input hold queue     IQD: pkts dropped from input queue
OHQ: pkts in output hold queue    OQD: pkts dropped from output queue
RXBS: rx rate (bits/sec)          RXPS: rx rate (pkts/sec)
TXBS: tx rate (bits/sec)          TXPS: tx rate (pkts/sec)
TRTL: throttle count

  Interface              IHQ   IQD  OHQ   OQD  RXBS RXPS  TXBS TXPS TRTL
------------------------------------------------------------------------
* FastEthernet0/1          0     0   10     0     0    0     0    0    0
R2#



When the number of packets = 40 then they will be dropped from the output queue.

Code:


R2#sh interfaces f0/1 summary

*: interface is up
IHQ: pkts in input hold queue     IQD: pkts dropped from input queue
OHQ: pkts in output hold queue    OQD: pkts dropped from output queue
RXBS: rx rate (bits/sec)          RXPS: rx rate (pkts/sec)
TXBS: tx rate (bits/sec)          TXPS: tx rate (pkts/sec)
TRTL: throttle count

  Interface              IHQ   IQD  OHQ   OQD  RXBS RXPS  TXBS TXPS TRTL
------------------------------------------------------------------------
* FastEthernet0/1          0     0   40    39     0    0     0    0    0


After applying the fair-queue R2 will start forward multicast traffic and then after a few packets it will start to queue them again.

Code:
R2(config)#interface f0/1
R2(config-if)#fair-queue



Code:
R1#ping 239.1.1.1 repeat 100

Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 239.1.1.1, timeout is 2 seconds:
..................................................
Reply to request 50 from 10.0.1.1, 80 ms
Reply to request 51 from 10.0.1.1, 56 ms
Reply to request 52 from 10.0.1.1, 64 ms
Reply to request 53 from 10.0.1.1, 60 ms
Reply to request 54 from 10.0.1.1, 76 ms
Reply to request 55 from 10.0.1.1, 64 ms..............
.........................



If you remove the fair-queue command

Code:
R2(config)#interface f0/1
R2(config-if)#no fair-queue


The size of the output queue will be zero and R2 will start to forward all multicast traffic again. Then after 59 pings it will stop. I tried to increase the number of the dynamic queues but did not help.

Code:
R2#sh interfaces f0/1 | inc drop|qu
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 88
  Output queue: 0/40 (size/max)
     2 unknown protocol drops



Code:

Reply to request 58 from 10.0.1.1, 56 ms
Reply to request 59 from 10.0.1.1, 60 ms..........
.................






R1 F0/0 <- Gt96k FE -> F0/0 R2 F0/1 <- Gt96k FE -> F0/0 R3 - packets queued and dropped on R2 F0/1

R1 F0/0 <- Gt96k FE -> F0/0 R2 S0/0 <- GT96K Serial -> S0/0 R3 - no issues

R1 F0/0 <- Gt96k FE -> F0/0 R2 F1/0 <- AmdFE -> F1/0 R3 - no issues

R1 F0/0 <- Gt96k FE -> F0/0 R2 F2/0 <- NM-16ESW-> F2/0 R3 - Terrible!!! Almost every second packet dropped. R3 is receiving multicast traffic and replying back but the packets are dropped between R3 and R2. "The sh interfaces f2/0 | inc drop|qu" will show zero (0) drops



Code:
R2#sh interfaces s0/0 | inc drop|qu
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
  Output queue: 0/1000/64/0 (size/max total/threshold/drops)


I tired to increase the number of queues to mach GT96K Serial but that did not resolve the issue.

Code:
fair-queue 64 64 1000



After 20 pings packets started to queue

Code:
R2(config-if)#do sh int f0/1 | inc drop|qu
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
  Output queue: 17/1000/64/0 (size/max total/threshold/drops)
     0 unknown protocol drops



  Interface              IHQ   IQD  OHQ   OQD  RXBS RXPS  TXBS TXPS TRTL
------------------------------------------------------------------------
* FastEthernet0/1          0     0   30     0     0    0     0    0    0
R2#


Top
 Profile  
 
 Post subject: Re: [PATCH] Fix for packet looping on platforms with Gt96k F
PostPosted: Sun Mar 02, 2014 4:26 pm 
Offline

Joined: Sat Feb 22, 2014 9:47 pm
Posts: 11
Dear friends,

Please find attached another patch that appears to finally rectify the problem observed with packet losses. This patch must be applied after the first patch that corrects the packet looping problem if multicasts are enabled, i.e. both my previous and this patch must be applied to completely solve the issue.

When anubisg1 and NetAdmin reported their problems, I've replicated their topology. It turned out that their problems are not caused by any sort of packet looping - quite the contrary, once the R2's Fa0/x interface towards R3 started showing a non-empty Output queue, packets stopped being forwarded completely. I have therefore focused on the gt_eth_handle_port_txqueue() function in common/dev_gt.c that is responsible for sending frames.

To understand the underlying cause of the issue this patch solves, I first need to spend a few words on how Gt96k sends frames. A frame to be transmitted is stored in router's memory in one or more buffers at most 64K large, and for each buffer, there is a so-called Transmit Descriptor (TxD), a 16B-long structure also stored in router's memory that tells the Gt96k how to send the frame (whether the TxD ready to be processed, whether this is the first or last TxD in the chain, etc.), where to find the data block, what is its size, and where is the next TxD. These TxDs form a linearly chained structure and it is the responsibility of IOS to construct them before calling the Gt96k to send a frame. The first 4B of each TxD is a command/status longword, and its most significant bit is the Own bit. If this bit is set, the TxD can be used to access data to be transmitted (it is "owned" by the Gt96k). Once IOS has set this Own bit to 1 and triggered an interrupt to Gt96k to start the sending procedure, it must not touch the TxD until its Own bit is cleared because the TxD chain is currently being processed by the Gt96k. Once the Gt96k processes the TxD, fetches the associated buffer and appends it to the frame being constructed, it clears the Own bit in the TxD and moves to the next TxD in chain if this was not the last TxD.

It turned out that the function had a couple of issues in mimicking the Gt96k's behavior:

  • If a TxD is fetched by Gt96k whose Own bit is cleared, it means that the TxD is not yet ready to control a frame transmission. In this case, the Gt96k should stop transmitting, emit an interrupt and set up the interrupt cause register properly. This was not done - the function simply returned FALSE. The attached patch corrects this.
  • Currently, the emulated Gt96k cleared the Own bit in all but the first TxD, and the Own bit in the first TxD was cleared only after the frame was transmitted. The datasheet specifically states on multiple occassions that the behavior should be reversed: all TxDs except the last should have the Own bit cleared as soon as they have been processed. The Own bit in the last TxD, however, should be cleared only after the entire frame has been transmitted, as the last TxD also carries the result status of the transmission in the command/status longword. The attached patch corrects that. As a side note, because the code maintained the first TxD and a pointer to it which is not truly necessary according to the datasheet, I have removed variables txd0, ptxd and tx_start which are useless in the corrected code.

Surprisingly, neither of these corrections truly solved the issue although I consider them to be important because they make the emulated Gt96k behave more closely to the datasheet specification. I spent hours of debugging and looking into custom outputs I've programmed into dev_gt.c to see what is going on. The behavior was the same - my IOS was happy sending packets until the TxD pointer reached a specific value (IOS-dependent but for my particular IOS version, the "TxD pointer of death" value was 0xf5e5390). Once the IOS reached this TxD, it simply stopped calling the Gt96k with any other new packets. I tried many things, from changing the order in which the Own bit is set, through emitting a series of interrupts with different reason indications, to even trying to stop sending interrupts and modifying the Own bits in any way - nothing had an effect. The IOS simply sat there without calling the Gt96k to send another frame. I was really getting lost.

When I started comparing the old dynamips versions and the differences in their source files I've noticed that between 0.2.8-RC1 and 0.2.8-RC2, there was a correction in handling the TxD descriptors for the serial part of the Gt96k, and the change was of this form:

Code:
       if (!(ptxd->cmd_stat & GT_TXDESC_F)) {
          ptxd->cmd_stat &= ~GT_TXDESC_OWN;
-         physmem_copy_u32_to_vm(d->vm,tx_current,ptxd->cmd_stat);
+         physmem_copy_u32_to_vm(d->vm,tx_current+4,ptxd->cmd_stat);


The gt_eth_handle_port_txqueue() used the same approach so I had a look as well. This snippet of code is the handling of the Own bit - once the TxD is processed, the Own bit is cleared and the changed command/status longword is written back to the TxD. The interesting thing - and to this very moment I do not understand it! - is that while the TxD is fetched from router's memory using the tx_current pointer and the command/status longword is the first 4B of the TxD, storing back the command/status longword must be done using tx_current+4 pointer. Whether this is a memory alignment issue, or a quirk related to difference in endianness of ordinary PCs and routers, or if there is some other reason why storing a value that is in the beginning of a memory block must be shifted by 4 bytes from the address used to fetch that memory block - I do not know and I would be very thankful if someone cleared this for me.

In any case, it has turned out that the gt_eth_handle_port_txqueue() was missing this correction of +4 when storing back the command/status longword of all but first TxDs. This was probably the reason why unicast communication mostly appeared to work: unicast packets were shown to be transmitted using just a single TxD which happened to be first and last at the same time, and this one had the +4 correction applied correctly. However, multicast packets seem to be sent using two TxDs and two buffers - the first buffer contains the pre-built Ethernet header, the second buffer contains the multicast packet (obviously a result of mroute fast switching cache, as the first packet was forwarded using a single TxD while all others were forwarded using two TxDs). Now because in the original code, all but the first TxDs were stored back without the +4 correction, the command/status longword with the cleared Own bit was stored in a wrong place in memory. Hence, when the IOS looked at the Own bit of these TxDs, they were still marked as owned by the Gt96k. After a number of TxDs piled up in IOS with the Own bit set stuck, the IOS probably had troubles allocating new TxDs (or got annoyed at the number of occupied TxDs), and started queueing packets.

So the solution was basicaly to apply the +4 pointer correction when storing the command/status longword back to router's memory. Once again - I do not know why fetching the value is fine with tx_current while storing it back requires tx_current+4 - and I sincerely ask anyone more acquainted with dynamips internals to explain this to me - but ... at the end of the day, it appears to solve the issue.

I could have gone with simply correcting the tx_current pointer by the offset of +4 and leave the sources otherwise unchanged. I decided to go with the improvements of the Gt96k emulation, though, because I believe that whatever changes are done to mimic the hardware more closely, they make the emulation more robust.

Please find the attached patch. It should be applied after first applying the first patch. Tests and feedback are most welcome!

Best regards,
Peter


Attachments:
File comment: Gt96k patch correcting the reporting of processed transmit descriptors back to IOS
dynamips-gt2.patch [3.83 KiB]
Downloaded 120 times
Top
 Profile  
 
 Post subject: Re: [PATCH] Fix for packet looping on platforms with Gt96k F
PostPosted: Sun Mar 02, 2014 7:31 pm 
Offline

Joined: Wed Feb 16, 2011 5:07 pm
Posts: 114
just one simple thing to say... it WORKS! :D

i'm going to put those patches in my linux builds waiting for 0.2.12 ... i'll try to compile it also for windows, but my codeblocks is not happy with make files :((


Top
 Profile  
 
 Post subject: Re: [PATCH] Fix for packet looping on platforms with Gt96k F
PostPosted: Sun Mar 02, 2014 8:40 pm 
Offline

Joined: Sun Feb 27, 2011 1:53 am
Posts: 32
I can compile it on Windows and test it once I get some free time later.


Top
 Profile  
 
 Post subject: Re: [PATCH] Fix for packet looping on platforms with Gt96k F
PostPosted: Mon Mar 03, 2014 6:21 am 
Offline

Joined: Sun Feb 27, 2011 1:53 am
Posts: 32
Compiled on Windows 7 64Bit and CONFIRMED -> The patch is WORKING !!!! Thanks a lot paluchpeter !!!


Top
 Profile  
 
 Post subject: Re: [PATCH] Fix for packet looping on platforms with Gt96k F
PostPosted: Mon Mar 03, 2014 6:42 am 
Offline

Joined: Sun Feb 27, 2011 1:53 am
Posts: 32
I just noticed that after every 69 pings there will be a space. Don't know if this is an issue or not :) . See attached picture.

Code:

Reply to request 68 from 10.0.1.1, 124 ms
Reply to request 69 from 10.0.1.1, 88 ms
SPACE
Reply to request 70 from 10.0.1.1, 88 ms
Reply to request 71 from 10.0.1.1, 112 ms

Reply to request 138 from 10.0.1.1, 101 ms
Reply to request 139 from 10.0.1.1, 100 ms
SPACE
Reply to request 140 from 10.0.1.1, 72 ms
Reply to request 141 from 10.0.1.1, 90 ms

Reply to request 209 from 10.0.1.1, 120 ms
SPACE
Reply to request 210 from 10.0.1.1, 112 ms


Reply to request 279 from 10.0.1.1, 96 ms
SPACE
Reply to request 280 from 10.0.1.1, 140 ms





Attachments:
R1 ping space after every 69 replies.jpg
R1 ping space after every 69 replies.jpg [ 116.93 KiB | Viewed 2655 times ]
Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 26 posts ]  Go to page Previous  1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group

phpBB SEO