Running a large number of “counting QL” on a noSQL data store. Algorithm or best practice needed!

70 pts.
Google App Engine
Java applications
Java development
Hello Everyone
before we start let me give you some information about our environment:
* it is written fully in Java/J2EE.
* it is developed to be deployed on GAE "Google App Engine" 
* its GUI is developed by GWT.
* our problem is in a core development issue.
Here is my problem,
* i am building a web application where users "online" can search for listings in this website.
* first please open the web site and search for any keyword e.g. "Accounting".
* a page will be opened , [Narrow Search] has a way to allow you go to your target job easier "lets call this a filter"  ,lots of jobs down there.
* search filter includes sub-filters [Category , Company , City , State ].
* each sub-filter has many cases or options. like for "State has (California ,Iowa , Kansas , ...etc)" beside each one of them is the number of jobs that matches your current filter/sub-filter selection.
Now we want to allow this filter functionality and we want to make it fast.
making a count query for each sub-filter option is going to be an effective idea. 
kindly keep in mind that:
* users can add/remove listing.
* also listings can expire.
* number of sub-filters are higher for us "can reach 20".
* each sub-filter has between 2 and 200 options.
we are searching for the best practice or a suggestion of an algorithm or whatever to solve this problem.
here are 2 options we have reached so far:
1) building a statistics table to save these results in it, then update it each time listings number is changed , also keep a nightly background job to recalculate results. and we can show number of results directly from this table.
2) build a tree data structure to be loaded on memory and saved in a table each time it is updated. this tree contains the resulting numbers of listings in each option of sub-filters.
even though i still think this is not enough !!!
can anyone suggest a better idea?
all comments, questions, suggestions are very welcomed.
Mohammad S.

Software/Hardware used:
Google App Engine, J2EE, Java, Performance

Answer Wiki

Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

Discuss This Question: 5  Replies

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.
  • Mshafie
    this question is mainly aimed to people interested in: - high performance computing - cloud based computing - algorithms - best practice - data structure hope anyone can help.... any questions or suggestions are welcomed !!! regards mohammad s.
    70 pointsBadges:
  • carlosdl
    This is part of the question title "Running a large number of "counting QL" on a noSQL data store" But, from the problem description I understand that you are in fact using a database, but you think that executing count queries every time, it will be too slow. Is my understanding correct ? If so, what database are you using ? This is important, as depending on that different methods could be considered. For example, if Oracle, maybe a materialized view could be a good option (which would be the first option you mentioned), if MySql, maybe a memory table (heap) could be considered. "...even though i still think this is not enough " Have you tested this approaches. Is there a reason to think this won't be enough ?
    85,390 pointsBadges:
  • Mshafie
    thanks Carlosdl I am afraid it is not a database. it is Google App Engine Data store. counting is done to find numbers between brackets in searching filter "narrow search". these numbers are real time data , changed by changing data itself. kindly visit and check its narrow filter for more details. or please tell me if i should explain further more about usage here. i did not test it yet, but as a cloud computing platform it will cost more if it needs more CPU/... resources. i want to get the optimum design or best practice here. regards mohammad
    70 pointsBadges:
  • carlosdl
    Well, I didn't know how the Google App Engine worked, but I have read a little and now I have a better understanding of the situation. This is just a thought, as I have never developed for GAE and thus don't know of any good practices specific to it, nor do I know the inner details of how the GAE's data store works. For a better response time I think your 2nd option (a memory tree structure) would be the best one, unless the GAE offered some kind of caching capability, which could make queries faster when data changes are not frequent, in which case your best option would probably be the first one (the statistics table). Using a memory structure to store the data will probably be faster (depending on the type of structure and the search algorithms used), but would consume more memory and would add some complexity to the application code. Also, a background job would probably be needed to handle updates for expiring items. I would probably tend to prefer this option, but I really think that both approaches should be tested enough, to make a final decision.
    85,390 pointsBadges:
  • Mshafie
    thanks Carlosdl the application is on the cloud "GAE" , so i think it will not be an issue increasing memory usage few more KBs to hold that Data structure. but i agree with you code will be more complex and background jobs will be harder to maintain. maybe this will make us either take solution (1), and maybe make some sort of storing for that Data Structure in GAE data store. but still this is going to make its code highly complex. again thanks Carlosdl regards mohammad
    70 pointsBadges:

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

Thanks! We'll email you when relevant content is added and updated.


Share this item with your network: