Ellie Mae Encompass program

Renzore101 · Nov 29, 2016

Hello CF,

Does anyone have any experience troubleshooting the Encompass loan processing software? Tomorrow i'm taking a road trip to a branch office that is having some odd issues with the Encompass connection timing out. What I know at this point:

1) During times I was remoted into end users machines the program connection would time out, however during this exact time google still worked, I could ping the gateway, ping the WAN IP from my desk, GoToAssist still worked ect..

2) This issue seems to occur more frequently when more users are accessing the network at the branch

3) ISP has been deployed repeatedly, found a cabling issue with some sort of splitter and also replaced the modem

Tomorrow I am going to continue troubleshooting, however I am definitely open to any suggestions on what rabbit hole to crawl down next. At this point in my mind i'm thinking something is breaking a TCP connection that this program uses to talk to an external server or something.

Regards,

Bobby

beers · Nov 29, 2016

What's the specific topology between the server and the client? Keep in mind you would want to investigate every segment between the client and the server, including switch port metrics, router interfaces/mtu/nat/policies, VPN if applicable, etc. etc.

What do they have for WAN connectivity? Straight internet circuit or VPN overlay? If VPN, what kind of negotiation policy do you have (think timeout)? MPLS? What model of equipment? What's the latency between the client and the server? How much bandwidth is available (up and down)? Do you have any QoS policies applied? How about on the tunnel? Provider?

Have you pulled any packet captures? Where did you pull them from? Are they full of retransmits, large gaps between acknowledgements or other odd behavior? Do any other applications exhibit this behavior, perhaps someone mentioning 'product x is slow sometimes' or similar?

What troubleshooting steps have you taken currently? How many hops does it take to traverse the client to the destination? Can you observe any congestion along the selected path? It sounds like 'general internet' works okay, but it would help to get an overview of exactly where everything sits.

We have a crap product at work that's extremely latency sensitive by doing direct SQL calls over the WAN. The delay in all of the calls makes the app perform horribly around ~65ms+, not sure if a similar situation but a poorly coded application can be awful even on a decent performing network.

Renzore101 · Nov 29, 2016

beers said:
What's the specific topology between the server and the client? Keep in mind you would want to investigate every segment between the client and the server, including switch port metrics, router interfaces/mtu/nat/policies, VPN if applicable, etc. etc.

What do they have for WAN connectivity? Straight internet circuit or VPN overlay? If VPN, what kind of negotiation policy do you have (think timeout)? MPLS? What model of equipment? What's the latency between the client and the server? How much bandwidth is available (up and down)? Do you have any QoS policies applied? How about on the tunnel? Provider?

Have you pulled any packet captures? Where did you pull them from? Are they full of retransmits, large gaps between acknowledgements or other odd behavior? Do any other applications exhibit this behavior, perhaps someone mentioning 'product x is slow sometimes' or similar?

What troubleshooting steps have you taken currently? How many hops does it take to traverse the client to the destination? Can you observe any congestion along the selected path? It sounds like 'general internet' works okay, but it would help to get an overview of exactly where everything sits.

We have a crap product at work that's extremely latency sensitive by doing direct SQL calls over the WAN. The delay in all of the calls makes the app perform horribly around ~65ms+, not sure if a similar situation but a poorly coded application can be awful even on a decent performing network.

Beers thanks a lot for this very thoughtful feedback, I will use it in my ongoing troubleshooting investigation tomorrow. I am not the network admin, however I am working closely with him on this issue. I am also new to this organization so I am not 100% positive how specifically the network is architected. At this point what I do know is that we are using Cisco Meraki routers at the branches. The InfoSec analyst just recently configured a full-mesh VPN setup with most of the branch offices that currently have a Meraki installed, however I am not sure of the protocol specifics of the VPN. At this specific branch we have a comcast business internet service with a 75/15 package I believe.

I am logged into the Meraki dashboard now taking a look at the network configuration settings and this is what I see:

1) We do have traffic shaping (QoS) configured on the branch router (3 rules High, Normal, Low), in this list I see a public host IP that I believe is used by this program under the high category

2) Security appliance is using IPsec with AES site-to-site hub mesh VPN

3) <60ms latency, although I would see occassional spikes to 100-200ms once every 100th ping

4) When running a continuous ping I would notice a single packet drop once every 200 pings or so

Tomorrow when I get to the branch I am first going to be replacing the Meraki, as I already shipped one out. Once this is completed all of our network gear at the branch will have been replaced. After this I intend on running a wireshark capture on several of the client machines to see if I can catch this in action. Is there somewhere else that I could run the capture that may give me a clearer picture?

beers · Nov 29, 2016

Ah, cool. Are they MX64? I deployed our company's managed VPN appliance platform on Meraki so I have some familiarity with them.

Brag map: http://i.imgur.com/hNXaOxt.png

Do you see this spiking everywhere else on the WAN or just that destination? What kind of average utilization do you see up/down? Are you doing any shaping on the interface to match 75/15?

While the Meraki is pretty good at layer 7 traffic reporting, it presents its own issues trying to troubleshoot the internet and similar just based on the lack of tools and interface info they provide. You can set up your own custom ping monitors to a couple of internet destinations and observe a loss graph over time, although that's part of the newer firmware if you haven't upgraded any time soon.

I believe support also has a few additional throughput tools they can use between branches if you open a ticket. Just say you are having some problems between branches and are curious what you can achieve between the head end and client appliance (preferably at the edge of business hours so it doesn't impact your user base).

Renzore101 said:
replacing the Meraki

With another Meraki?

Renzore101 · Nov 29, 2016

Awesome, that sprawling Meraki network is probably more dependable than AT&T! I may be talking to the right person then. Yes, they are Meraki MX64W routers, and I am going to be swapping the MX64W with an identical model. To answer your question about the spiking, let me get back to you on that, as i'll have to run some more tests to some of the other branches.

The utilization is also an issue as it is not reported as obviously as my previous job. I was used to seeing Scrutinizer reports with a total bandwidth utilization on internet circuits out of a total of 100%, however in this network the only real equivalent is an "Uplink" tab in the Meraki dashboard that is real time, which is tricky. In scrutinizer I had the ability to view graphs that would show total circuit utilization over periods of time. In the Meraki dashboard as I said it is real time, so it's hard to accurately gauge if they are over-utilizing the circuit. I will however mention that this is an area that I had been obsessively looking at because I thought maybe that comcast could be throttling the branch during peak usage times, however during times of the time-out they were nowhere near 75Mb on the uplink traffic.

We also do have a ping test to 8.8.8.8 i'm guessing with this data we are generating latency and loss graphs that I can blow out to the last month , showing the <60ms latency and mostly <1% loss.

beers · Nov 29, 2016

Renzore101 said:
"Uplink" tab in the Meraki dashboard that is real time

There are a few spots where it will give you historical layer 7 data. Network -> Summary Report is a good one to easily see up vs downstream. You can also break it down per client on Network -> Clients to see individual host utilization or Network -> Traffic Analytics will give you similar stats with some additional graphs and data.

Renzore101 · Nov 29, 2016

beers said:
There are a few spots where it will give you historical layer 7 data. Network -> Summary Report is a good one to easily see up vs downstream. You can also break it down per client on Network -> Clients to see individual host utilization or Network -> Traffic Analytics will give you similar stats with some additional graphs and data.

Thanks for pointing me in this direction, i'm combing through these menus now trying to make sense of it all. I am noticing a trend at this branch and a few others of abnormally high RTT max ms, i'm making note of all branch uplink details and then i'm going to run through and tally up the branches with the higher total clients on average. I am beginning to wonder if this could be an issue at other locations, just more readily manifest at this location because it sees more traffic. The loss rate never appears to be at or over 1% in the records that i've seen thus far. Do you think RTT spikes could have something to do with some sort of TCP timing issue?

Ellie Mae Encompass program

Renzore101

Member

beers

Moderator

Renzore101

Member

beers

Moderator

Renzore101

Member

beers

Moderator

Renzore101

Member