Archive for the ‘performance’ Category

Inside Horizon: interactive analysis at cloud scale

April 15th, 2011 | Ari Gesher

Late last year, we were honored to be invited to talk at Reflections|Projections, ACM@UIUC’s annual student-run computing conference. We decided to bring a talk about Horizon, our system for doing aggregate analysis and filtering across very large amounts of data. The video of the talk was posted a few weeks back on the conference website.

Horizon started as research project / technology demonstrator built as part of Palantir’s Hack Week – a periodic innovation sprint that our engineering team uses to build brand new ideas from whole cloth. It was then used by the Center For Public Integrity in their Who’s Behind The Subprime Meltdown report. We produced a short video on the subject, Beyond the Cloud: Project Horizon, released on our analysis blog. Subsequently, it was folded into our product offering, under the name Object Explorer.

In this hour-long talk, two of the engineers that built this technology tell the story of how Horizon came to be, how it works, and show a live demo of doing analysis on hundreds of millions of records in interactive time.

Palantir: search with a twist (part two: realtime indexing and security)

October 27th, 2009 | Ari Gesher

magnifying glass

[A number of weeks ago, we published a post on the search technology used by Palantir. That post covered raising the memory efficiency of a couple of operations. This is part two of that series.]

The most familiar use of search engines is to index documents made available on the Internet via the hypertext transfer protocol. Forgotten names like AltaVista, names not-yet-really-learned like Bing, and, of course, Google come to mind.

This one, massive use case has a couple of properties that I’d like to highlight:

  • Asynchronous indexing and querying – web search engines tend to use crawlers and indexers to build up an index of the web. After each crawl is finished, the new index is brought online for use by the query engine.
  • Lack of access controls – all the data in the index is available to any query. In fact, most queries are (from the standpoint of the index) completely anonymous.

Palantir: not a web search engine

Search technology is just one part of what makes up a Palantir system. For us, it’s a way to quickly retrieve Palantir objects in a Palantir system, it’s not the whole of the application.

I’d like to highlight a couple of differences from the web search engine case. A Palantir system needs the following properties:

  • Realtime indexing and querying – we need information to be available immediately as it changes in the system.
  • Leak-proof access controls – we need the search engine to help us make sure that we don’t have information leaking across access control boundaries.

Hit the link to read more about these topics.
Read the rest of this entry »

Palantir: search with a twist (part one: memory efficiency)

August 13th, 2009 | Ari Gesher

magnifying glass

A Palantir cluster seamlessly integrates many pieces of proven technology. One of them is our customized version of the venerable Java search engine, Lucene. Search engine technology tends to be optimized for the common use case of indexing web documents (or similar information architectures) where you have a few search terms in each query and many, many documents as results. We want to leverage the inverted index capabilities of Lucene, but our data access patterns are a bit different than the typical use case: we need things like pervasive range-querying, different types of relevance, and dynamic views of the data based on security constraints. So in building our data platform, we’ve run into some interesting challenges that are pretty unique in the information retrieval realm, specifically:

  1. Raising memory efficiency
  2. Real-time indexing
  3. Preventing information leaks across access boundaries in an efficient manner

I’ll cover (1) in this post and (2) and (3) in a later post, due out in about two weeks. (Note: part 2 is available here)

Hit the link and we’ll delve into this topic.
Read the rest of this entry »

Bandwidth isn’t cheap. Disk isn’t cheap. CPU isn’t cheap.

May 22nd, 2009 | Bob McGrew

fake clearance screen

At Palantir, we work in Silicon Valley, read High Scalability, and think of web companies like Facebook and Google as our peers. Most of the time, this is exactly the right recipe for bringing disruptive innovation into the intelligence community. Sometimes, though, it’s misleading – when discussing a design decision, it’s received knowledge that “Disk is cheap.” or “CPU is cheap”. For a web company with a deployment in a commercial data center (or its own data center), this received knowledge is correct. But for a company that ships distributed systems instead of hosting them, and for whom the deployment environment is the kind of locked-down server room in which classified data can reside, these assumptions couldn’t be more false.

At Palantir, we are almost never able to host our customers’ data – typically, as the data is very sensitive, we are not even allowed to see it! Our customers’ highly sensitive data has to reside in a Secure Compartmented Information Facility or SCIF – a building which has been built to be resistant to attempts to access the information within, whether through active or passive measures. The network inside a SCIF is physically separated – “airgapped” – from the public Internet to prevent information leakage. As the entire rationale for such facilities is to prevent information leakage, moving information into or out of one is a tightly regulated process, almost always requiring a human to be in the loop.
Read the rest of this entry »

Oracle’s JDBC driver + prefetch == garbage [collection]

February 23rd, 2007 | Ryan Porter

The Problem

Recently, we were experiencing major performance problems with loading documents from the database. Profiling did not isolate a single cause; everything (including unrelated, background operations) seemed slow. So, we started logging garbage collection, and found that we were collecting garbage at a rate of 20GB/min!

Profiling revealed that the worst offender, by far, was OracleStatement.prepareAccessors(). Interestingly, it only caused a problem when our result set included a LOB. For such queries, it allocates a 1MB object, regardless of whether the query returned any results at all.

Google searches revealed others who saw similar problems when accessing LOBs, but no solutions other than upgrading or changing drivers. We were already using the latest Oracle JDBC driver, and reverting to earlier drivers did not help. Switching drivers did solve the problem; however, pushing the change to production would require extensive testing to ensure that we were not trading in one problem for another (or more).

I was about to start conducting these tests when John discovered that we were setting the OracleConnection parameter “defaultRowPrefetch” to 1000. This parameter determines how many rows are pulled back from the database on each round-trip, and increasing this value from its default of 10 will normally yield a performance gain. As an experiment, I set the value to 1, and re-profiled memory allocation. The amount of memory allocated by OracleStatement.prepareAccessors() decreased by about three orders of magnitude. Thus, it appears that when a query can return a LOB, Oracle’s JDBC driver allocates approximately “rowPrefetch” KB of memory, even when zero rows are returned.

The Solution

Returning the “defaultRowPrefetch” parameter to 10 did rid us of our garbage collection problems. However, because this is a global setting, reverting it reduced the performance of many other queries which returned many rows with no LOBs. The prospect of setting “rowPrefetch” on a per-query basis was unappetizing, to say the least, but the performance loss was significant. In the end, we altered how we retrieve rows from the database so that the fetch size geometrically increases as we pull results back from the database.

Specifically, the first batch we retrieve contains at most 10 rows, after which we increase the batch size to 20. Once we’ve retrieved 20 more rows, we increase the fetch size to 40, and so on. In this way, we never allocate large amounts of memory for queries which return few (or no) results, but we still quickly ramp up to a large fetch size.

For large queries which returned no LOBs, this solution is still slower than when “defaultRowPrefetch” was 1000. However, the slowdown on those queries was minor, overall system performance was substantially improved, and, importantly, the changes did not require any per-query tuning.


Palantir