Reply
Thread Tools
Posts: 370 | Thanked: 443 times | Joined on Jan 2006 @ Italy
#11
Originally Posted by hawaii View Post
Moving hildon-sv-notification-daemon out of [mediasrc] closes the socket and doesn't allow any sound?
Why should it? As far as I understood, changing the syspart.conf just changes resources utilization on a per-process basis, and that's the whole reason for ohmd presence. So, it should simply lower their priority.
What I can affirm is that on my machine in its current state, the baloons are now delayed (also 5 or 10 seconds) while chatting, don't know about emails, but I hear both vibration and notification sounds.
 

The Following 2 Users Say Thank You to jurop88 For This Useful Post:
Dark_Angel85's Avatar
Posts: 519 | Thanked: 123 times | Joined on Oct 2010 @ Malaysia
#12
if your configurations really make the n900 snappier and more responsive without sacrificing anything else, it should really be in a wiki or included via swappolube or something...

This is just a great work that you've done. Marvellous
 
Posts: 1,258 | Thanked: 672 times | Joined on Mar 2009
#13
As for kernel reporting mmcb blocksize as "512k", it's not. It's saying logical blocksize is 512 bytes. This is meaningless for your purposes though, it only tells you the smallest request size that the mmc will accept. Internally it then translates 512 byte write into a read-modify-erse-write cycle of 128k or 256k, whatever its true block size is.

This brings us to the "noop" scheduler issue. You are correct that there are no moving parts, but the huge blocksize calls for scheduling writes close to eachother anyway, to minimize the amount of read-modify-erase-write cycles the mmc/usd has to do.

Imagine if kernel sends request for writing 4k at position 2M, and then 4k at position 8M, and 4k at position 2M+4k, 4k at 8M+4k, and so on. Each request makes the uSD/emmc internally read 128k (assuming that's the true eraseblocksize), change 4k of that 128k, erase another 128k block, write 128k to that block. A write amplification factor of 32. You can divide your raw write rate of a nominal 6Meg/s for Class6 with 32 to get estimated 192 kilobytes/sec...
So ideally we'd want an elevator that knows about the special properties of flash. but we don't have one, so we use CFQ. which atleast has some heuristics for distributing IO "fairly" between processes.


Incidentally, this is where the explanation for why moving swap to uSD seems to improve performance begins too.

The heaviest loads for the emmc is swap, and anything that uses databases like sqlite. That includes dialer and conversations, calendar, and many third party apps. Why is this a heavy load? Because these things typically write tiny amounts of data, and then request fsync() to ensure the data is on the disk. This triggers the writeout of all unwritten data in memory, and updating all the filesystem structures. Remember that a tiny amount of data spread out randomly triggers massive amount of writing internally to the emmc. Worse, while this goes on, all other requests are blocked.
And what else besides /home and swap is on emmc? /opt. Containing, these days, both apps and vital parts of the OS. The CPU is starved for data, waiting for requests to be written out so that the requests for the executable demand-paged code of apps can complete.

Btw for Harmattab I'm told sqlite will be using a more optimized db, that essentially works like one gigantic journal. Sequential writing is fast and good on flash, random in-place updates is bad.

Moving swap to uSD gives a path for swap that is always free (well almost always unless you do heavy acesses to uSD by other means), and offloading swap from emmc means less random IO load on the emmc.
 

The Following 7 Users Say Thank You to shadowjk For This Useful Post:
Posts: 92 | Thanked: 95 times | Joined on Feb 2010 @ Smyrna, Atlanta / Bangalore, India
#14
@jurop88

lots of respect and thanks..thats fantastic and lots of mindblowing effort you have put in.

it took me 3 reads just to understand things you have tried out..

very impressive..hope u do some more r&d and we can make the n900 more better

Thanks
 
Posts: 370 | Thanked: 443 times | Joined on Jan 2006 @ Italy
#15
Hi Shadowjk,

thank you for you participation.

Originally Posted by shadowjk View Post
As for kernel reporting mmcb blocksize as "512k", it's not. It's saying logical blocksize is 512 bytes. This is meaningless for your purposes though, it only tells you the smallest request size that the mmc will accept. Internally it then translates 512 byte write into a read-modify-erse-write cycle of 128k or 256k, whatever its true block size is.
Fair enough and rather consistent with which I found on the internet. Two questions:
1) why 512k will mean 512 byte? Can you point me somewhere, also through kernel source? I just started digging on the matter, found relevant code in the mmc driver (I hope to be on the right path to understand something) but I must admit my C knowledge is rather rusty
2) where to find the true HW block dimension? Is there a place where is it reported or shall I know it directly from the uSD producer?
The 128k size, though, explains why Nokians choosed to set page-cluster to 5; 32*4=128 and that's it

Originally Posted by shadowjk View Post
This brings us to the "noop" scheduler issue. You are correct that there are no moving parts, but the huge blocksize calls for scheduling writes close to eachother anyway, to minimize the amount of read-modify-erase-write cycles the mmc/usd has to do. Imagine... (CUT)
From Wikipedia,
The NOOP scheduler inserts all incoming I/O requests into a simple, unordered FIFO queue and implements request merging
It means, AFAIU, that when a block is ready to be written (request merging), it is written and the memory is freed.
Wikipedia again,
CFQ works by placing synchronous requests submitted by processes into a number of per-process queues and then allocating timeslices for each of the queues to access the disk. The length of the time slice and the number of requests a queue is allowed to submit depends on the IO priority of the given process (...) It can be considered a natural extension of granting IO time slices to a process
So, it doesn't work on a 'try to write as less blocks as possible on the uSD' level but the goal is to give all processes a time slice 'hoping' that most writing and reading will be done in the same area. I gave a quick read at the code, and it looked like an the 'elevator' part has a huge weight, allowing some trackbacks (I am not an expert in this area, so pick everything with a grain of salt). The overhead is rather consistent, and at first sight with almost no advantages in case of a IO device where no mechanical part are moving.
After having used the setting in the first page for some days, I have to say that with NOOP probably the fragmentation is bigger, but the feeling is that it works faster UNTIL IT WORKS. Another member on the forum (don't remember precisely who) set a swap rotation during the night in order to avoid this fragmentation, and I can confirm that after two days my N900 started 'choking' and a swapon/swapoff/swapon/swapoff let it fly again, in line with identifying the issue due to swap fragmentation.

Originally Posted by shadowjk View Post
So ideally we'd want an elevator that knows about the special properties of flash. but we don't have one, so we use CFQ. which atleast has some heuristics for distributing IO "fairly" between processes.
The argument is that we don't care about 'per process' I/O but exactly 128KB writings in order to speed them as much as possible.
What we ideally need is a scheduler saying:
Code:
- kernel: we need some free room. 
- scheduler: ok let's have a look at the discardable pages. Here they are. Just a second please. 
- scheduler choose exactly 128Kb ready for writing (and that's the page-cluster tunable at a kernel level, right?)
- scheduler frees the memory requested with a single page-writing
- scheduler: here I am again, you have those requested memory free
- kernel: thank you
The fact that then lot of pages are fragmented does not matter since the reading penalty is very low compared to - for example - an HD
I have already found an example of NOOP scheduler written in C on the internet, and it does not look to much hard to implement. Here we are speaking of brute force, not high math - A simple modified NOOP algorithm good for flash could look like:
Code:
- check if the page to be unloaded is already cached and not dirty or in the current queue
if yes -> load the page requested and discard the unloaded one
if no -> put in the queue the page to be unloaded and serve the page to be loaded
is the queue 128Kb? 
if yes -> write it out and update table of swapped pages
if no -> job done
I know that the real writing will be performed by the uSD HW controller, but why the hell the HW controller would split a perfect aligned 128KB writing? Your thoughts? Any kernel gurus in the neighbourhood? Am I missing something? It looks too simple in order nobody thought about it...

Originally Posted by shadowjk View Post
Moving swap to uSD gives a path for swap that is always free (well almost always unless you do heavy acesses to uSD by other means), and offloading swap from emmc means less random IO load on the emmc.
This sounds very reasonable and is consistent with other findings.

On a side note, I am digging into the ohmd & cgroups realm and I am happy to have learnt lot of things - probably the parameters in the first page will be tuned again after some days of usage and having looked at the patterns arised in terms of load and memory used.
EDIT - oh, and I forgot to report this https://bugs.maemo.org/show_bug.cgi?id=6203 where many hints on ohmd & syspart are given!

Last edited by jurop88; 2011-03-20 at 15:23. Reason: final addition
 

The Following 2 Users Say Thank You to jurop88 For This Useful Post:
Posts: 370 | Thanked: 443 times | Joined on Jan 2006 @ Italy
#16
hehe it looks like I made some confusion amongst kswapd and IO scheduler - still learning a lot in this illness period
 

The Following 3 Users Say Thank You to jurop88 For This Useful Post:
Posts: 1,397 | Thanked: 2,126 times | Joined on Nov 2009 @ Dublin, Ireland
#17
Hi jurop88,

I've spent at least 20 minutes trying to find again this thread as I'm doing some experiments with information that is split across multiple threads:
  1. Swappluble Wiki
  2. Massive interactivity improvement under high I/O load!
  3. Striping swap to increase performance under memory contention
  4. Nokia N900 Smartphone Performance Optimization Tune-up Utilities
  5. Swappolube to lubricate your gui

And this one

Have you made any more progress?
 

The Following User Says Thank You to ivgalvez For This Useful Post:
Posts: 370 | Thanked: 443 times | Joined on Jan 2006 @ Italy
#18
Originally Posted by ivgalvez View Post
Hi jurop88,

I've spent at least 20 minutes trying to find again this thread as I'm doing some experiments with information that is split across multiple threads:
(...)
Have you made any more progress?
It almost was same goal on my side. Now I'm back to my job so pace had slowed, but what I can say is that Nokia's engineers already did lot of work on the subject and the phone was probably best optimized for the general use case.
I since wrote the orginal post made some slight modifications, but still not updated here. Perhaps will do it through the WE
 

The Following 2 Users Say Thank You to jurop88 For This Useful Post:
Banned | Posts: 358 | Thanked: 160 times | Joined on Dec 2010
#19
Originally Posted by ivgalvez View Post
Hi jurop88,

I've spent at least 20 minutes trying to find again this thread as I'm doing some experiments with information that is split across multiple threads:
  1. Swappluble Wiki
  2. Massive interactivity improvement under high I/O load!
  3. Striping swap to increase performance under memory contention
  4. Nokia N900 Smartphone Performance Optimization Tune-up Utilities
  5. Swappolube to lubricate your gui

And this one


Have you made any more progress?
You want to look at the BFS-kernel thread, mlocker-thread ( my signature ) and the 4-Line-Cgroup-Patch, too!

Last edited by epitaph; 2011-03-25 at 09:24.
 
Banned | Posts: 358 | Thanked: 160 times | Joined on Dec 2010
#20
> partition desktop memory-limit 70M

When I've cgroups mounted I noticed that the desktop groups only need 25M.

So, it's better to write partition desktop memory-limit 25M

or echo "25M" > /dev/cgroup/cpu/desktop/memory.limit_in_bytes.
 
Reply


 
Forum Jump


All times are GMT. The time now is 22:57.