I am looking for general opinions on whether running polling engines on blade servers is a good idea or not?
We have been running pollers on both HP stand-alone servers, and HP blade servers with no issue. Our SQL DB is on a HP blade server with SAN disk. I have done LOTS of performance monitoring of the sql server hardware, especially disk I/O which is often where a SQL bottleneck truly is. I've got our database running on several different LUNS to the SAN to optimize performance. I've got the main sql install on one LUN, I've got another LUN for the main orionslx database file, I've got another LUN for the tempdb, another LUN for the transaction log, and a final LUN where i separated out the orionslx file groups for our netflow (orionslx1-4.mdf). There was a HUGE speed increase when separating these netflow filegroups. Each SAN LUN was set up as raid 1+0, and each one was represented by a different drive letter on the machine of course.
I was using perfmon and was looking at current disk queue length, and %disk time. All the additional LUNs created a very noticeable performance increase.
Hope this helps.
I have been running Orion NPM/NCM on HP Blades for 2 years now, and I have never had a problem.
One thing you will need to consider is how much space you need for a database, and/or if your blade platform will offer fast enough disk IO for your environment. Since I run on HP blades I only have two HD's available -- this hasn't been a problem for me since I poll < 1000 elements. HP does offer storage blades that act as direct attached storage, but I haven't had a need. Even if your database is massive, or you have an existing SAN infrastructure you'd like to use for your database, most (if not all) Blade vendors offer hardware to integrate into your environment -- iSCSI, FC, etc.
If you do use HP blades, HP sells Fusion-IO drives (Branded as HP IO Accelerator) that offers 80GB, 160GB, and I think 320GB SSD storage in the HP mezzanine card format. I've used the IO Accelerator for other applications and it works great.
No worries there. I already have a very large Orion implementation and am looking to install a 3rd polling engine. My boss asked about the possibility of moving to blade servers for just the polling engines, hence the question. I have a stand alone server for the db, with the actual db living on a very speedy SAN.
I have a large Orion deployment: Orion Server, HSB engine, 9 Polling Engines, and a monster DB server all are on blade servers. Been running there for the last three years. No issues
All are HP Blade Servers - E5450 (2HTx3.00 GHz) with 16 GB RAM; same hardware for the DB, but 32 GB RAMThe NICs are teamed and set to fault-tolerant. Nothing fancy
Hi jtimes,
I am quite interested in knowing more about the sizing of your setup with blade servers. I am hesitating whether to go fot HP blades instead of HP Proliant DL360 which supports upto 6 disk spindles for the DB server.
My setup is 6000 elements with 50 NTA (NetfLow) interfaces and default polling intervales. Orion NPM/NTA Server and DB server.
What are the exact specs, especially Disks of your blade DB server ? how many disks ? RAID 1+0 ? Disks with accelerator or not ? what model/type of disks used for the DB ? are disk I/O Ok ?
Thanks
I have a large Orion deployment: Orion Server, HSB engine, 9 Polling Engines, and a monster DB server all are on blade servers. Been running there for the last three years. No issuesAll are HP Blade Servers - E5450 (2HTx3.00 GHz) with 16 GB RAM; same hardware for the DB, but 32 GB RAMThe NICs are teamed and set to fault-tolerant. Nothing fancy
I am impressed and interested in your setup!
Can you tell us all how what your number of monitored elements breakdown looks like, what (if any) additional SW modules you are using, and what kind of stuff you are monitoring?
Thanks for sharing what you can!
The db server is a "standard" HP Blade with two physical drives. The first drive is partitioned into two(1-OS, 1-Apps); the second drive only has NetPerfMon.db, the sys db files, and the page-file. The db's get backed-up daily. NetPerfmon is around 50GB.SQL2005 SP3 all tweaked out and 27GB memory dedicated to SQL.
Nightly maint takes roughly 45-60 minutes.Retaining Detailed for 7Retaining Hourly for 31Retaining Daily for 90Retaining Events for 32*Retain Syslog for 32 days - not listed in swdebugMaintenance.log
number of monitored elements breakdown:Network Elements 160771 ElementsNodes 7300 NodesInterfaces 152480 InterfacesVolumes 991 VolumesPolling Intervals:Check response time and status of each node every 240 secondsCheck the status of each interface every 160 secondsCheck status of each volume every 520 secondsRediscover each Node, Interface, and Volume every 60 minutes
Statistic:Collect statistics from each Node every 10 minutesCollect statistics from each Interface every 5 minutesCollect statistics from each Volume every 15 minutes
I can't post the numbers of or each device type, but here are the various manufactures: 3Com, Adtran, Avaya, Cisco, F5 Labs, HP, IBM, net-snmp Linux, Nokia, Sun, Tandberg, Visual Networks, Windows
Lots of UDP(s) for the F5's and Nokia's
200+ clients using Orion 24x7
Orion NPM is the only module ran
Still on 9.0 SP2, because all the customization re-work, oh and having to do each upgrade action 9 times...
Your netperfmon db lives on a single drive? As in, 1 spindle? During a shutdown/restart of all your engines, what is the avg disk queue length on that drive?? Are you running 64 bit SQL or 32 bit and utilizing AWE to be able to allocate the 27 GB of RAM to SQL?
Thanks so much for this information. Are the 200+ clients all accessing the Orion web interface?
timf, that helps a lot. Thanks. I was particularly curious about running pollers in a mixed environment of standalone servers and blades.
Sounds like you went above and beyond with optimizing your db. I moved my db to a large EVA SAN and using AWE for 32 bit SQL, bumped available memory to 6 GB. Those 2 things are what did the trick for us, to remove the I/O bottleneck. If we opt to run NTA again in the future, I may ping you and ask how you went about splitting out netflow data. Sounds interesting.
Kiwi,
I have an SLX machine running the website, APM, and IP SLA. This is a 2x Quadcore w/ 16gb of ram. Bit of an overkill. It polls 3500 elements, ~40 IP SLA, 660 APMs, ~200 UnDPs, and a LOT of traps.
Next I have a poller on hardware, 1x Quadcore w/ 8gb of ram. It polls 3 nodes, but totals 10,600 elements across those 3 nodes.
Last poller is a virtual machine, we give it 2x 3.ghz cores and 2gb of ram. It polls 7070 elements.
Our DB server is physical hardware. 2x Quadcore w/ 16gb of ram. 6x 146gb 15k rpm in (unfortunately) Raid 6. 64x OS and 64x SQL.
For polling the default is 300s/300s/300s with a rediscovery of 1440 minutes. However, many devices and interfaces are more critical and set to 120s instead.
Statistics are similar, 10m/9m/15m although many interfaces collect every minute.
Retention is 35d/35d/365d/30d. I do the 35d retention on detailed statistics for monthly 95th percentile compositions.
Manufacturers: Adtran, APC, Cisco, Dell, HP, Linux, Polycom, Vmware, Windows to name a few.
Fully patched across all systems for NPM, IP SLA, APM.
We have approximately 100 users, and my APM would tell me that I have an average connections of 20.