Child pages
  • Mapping Information
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 21 Current »

Summary

We generate five datasets, each of which is ingested into a scratch database where they can be combined via SQL. The datasets are:

Nr.NameSourceColumnsSizeComments
1bta_transcodesbta databaseprogram_guid, walltime360262Walltimes calculated directly from data collected during transcoding (since 2012).
2all_programsdigitv databasedigitv_id, ritzau_id, starttime866272

Restricted to relevant radio channels only, but otherwise including all programs, recorded or not.

3all_transcodes_with_shardsDOMS risearchprogram_guid, digitv_id, shard_guid1073244Transcoded programs from 2012 and before. Includes both radio and TV.
4all_transcodes_without_shardsDOMS risearchprogram_guid, digitv_id3303413Contains every DOMS program with Ritzau metadata. (So it is misnamed.)
5mediestreamMediestream SOLR indexprogram_gui666292Dump of every radio program in the current mediestream index.

Background

We need to map old objects in MCM, which are identified by either ritzau-original id's or shard id's, to DOMS Program GUIDs, and find the walltime for the start of every file.

In the bta database:

bta-prod=> select distinct(channel) from broadcasttranscodingrecord where video = false and transcodingcommand is not null; channel  
----------
 drp1
 drp3
 drp4o
 drp5
 voice
 drp8j
 drp7m
 100fm
 drp4k
 drp6b
 novafm
 drp2
 drp4s
 drrar
 drmamara
(15 rows)
bta-prod=> select distinct(channel) from broadcasttranscodingrecord where video = true and transcodingcommand is not null;
     channel      
------------------
 folketinget
 dr2
 dk4
 tv2s
 kanal4
 kanaloestjylland
 drk
 tv2z
 drhd
 tv2d
 dr1
 drup
 tv2c
 tv2news
 tv3p
 drram
 voicetv
 sbsnet
 tv3
 kanal5
(20 rows)

So the video column correctly identifies tv and radio channels. Also

bta-prod=> select distinct(startoffset) from broadcasttranscodingrecord where video = false and transcodingcommand is not null;
 startoffset 
-------------
         -30
(1 row)

So all radio programs have a -30 second startoffset.

According to Henning's old LARM exporter, all other (older) files/programs have a -20 second offset.

We can find all channels in mediestream with a query like 

http://prod-search03:57308/doms/sbsolr/collection1/select?q=*%3A*&fq=recordBase%3Adoms_radioTVCollection&wt=json&indent=true&facet=true&facet.field=channel_name&facet.limit=100000

BTA_TRANSCODES

In BTA we record the start-time offset in the database so we can get walltime out of the bta database with a command like

bta-prod=> select id,broadcaststarttime,startoffset,to_char(to_timestamp(extract(epoch from broadcaststarttime)+startoffset),'YYYY-MM-DD HH24:MI:SS') AS walltime_start from broadcasttranscodingrecord where id='uuid:5ae2177a-537c-4d1e-84d8-27a62e8f8d8b';

Or putting it together

bta-prod=> select id,to_char(to_timestamp(extract(epoch from broadcaststarttime)+startoffset),'YYYY-MM-DD HH24:MI:SS') AS walltime_start from broadcasttranscodingrecord where video = false and transcodingcommand is not null;

For csv output:

psql -h phoebe -U bta-ro -d bta-prod -t -A -F"," -W -c "select id,to_char(to_timestamp(extract(epoch from broadcaststarttime)+startoffset),'YYYY-MM-DD HH24:MI:SS') AS walltime_start from broadcasttranscodingrecord where video = false and transcodingcommand is not null" > walltime.csv

ALL_TRANSCODES_WITH_SHARDS

We can also obtain a list of mappings between program-guid, digitv-id and shard-guid with an appropriate sparql query:

PREFIX dc: <http://purl.org/dc/elements/1.1/>
select $obj $id $shard
where  {
?obj <info:fedora/fedora-system:def/model#hasModel> <info:fedora/doms:ContentModel_Program>.
?obj dc:identifier ?id .FILTER regex(str(?id), 'Ritzau').
?obj <http://doms.statsbiblioteket.dk/relations/default/0/1/#hasShard> ?shard.
}

I use the risearch GUI in firefox with firebug to translate this to a uriencoded  DOMS query:

curl --user fedoraReadOnlyAdmin:<pwd>  'http://naiad:7880/fedora/risearch?type=tuples&lang=sparql&format=CSV&dt=on&stream=on&query=PREFIX+dc%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2F%3E%0D%0Aselect+%24obj+%24id+%24shard%0D%0Awhere++%7B%0D%0A%3Fobj+%3Cinfo%3Afedora%2Ffedora-system%3Adef%2Fmodel%23hasModel%3E+%3Cinfo%3Afedora%2Fdoms%3AContentModel_Program%3E.%0D%0A%3Fobj+dc%3Aidentifier+%3Fid+.FILTER+regex%28str%28%3Fid%29%2C+%27Ritzau%27%29.%0D%0A%3Fobj+%3Chttp%3A%2F%2Fdoms.statsbiblioteket.dk%2Frelations%2Fdefault%2F0%2F1%2F%23hasShard%3E+%3Fshard.%0D%0A%7D' > all_transcodes_with_shards.csv

and then use sed to filter out extraneous material.

ALL_TRANSCODES_WITHOUT_SHARDS

Similarly find all the transcodes without shards (ie all programs in DOMS whether or not they have shards) :

PREFIX dc: <http://purl.org/dc/elements/1.1/>
select $obj $id $shard
where  {
?obj <info:fedora/fedora-system:def/model#hasModel> <info:fedora/doms:ContentModel_Program>.
?obj dc:identifier ?id .FILTER regex(str(?id), 'Ritzau').
}

in all_transcodes_without_shards.csv. 

ALL_PROGRAMS

This is a list of all digitv-id's, ritzau-original-id's and scheduled start-times. This is just a dump from the digitv database.

 psql -h digitv-prod -d digitaltv -U digitvro -t -A -F"," -W -c "select id, program_id, starttid from ritzauprogram WHERE channel_name IN ('drp1','drp2','drp3','drp4o','drp4k','voice','drp8j','drp7j','100fm','drp6b','novafm','drp4s','drrar','drmamara', '24syv', 'drp5', 'drpk', 'drp7m', 'tv2radio', 'rakl', 'drp5000', 'radio2', 'boomerang', 'drp4k94', 'drp4re', 'drp4mv', 'drp4b', 'drp4nj', 'drp4f', 'drp4fyn', 'drp4miv', 'drp4nor', 'drp4blm')" > all_programs.csv

MEDIESTREAM

We are only interested in programs which are actually in mediestream. These can be found with

csr@pc609:~/larm_doms_mapping$ wget -O - 'http://prod-search03:57308/doms/sbsolr/collection1/select?q=lma_long%3Aradio&rows=1000000&fl=recordID&wt=csv&indent=true'|sed 's!doms_radioTVCollection:!!' > mediestream.csv

There are a handful of programs that are in the mediestream index even though their transcoding failed. To find these, find all failed bta transcodings: 

[bta@iapetus ~]$ psql -h phoebe -U bta-ro -d bta-prod -t -A -F"," -W -c "select id from broadcasttranscodingrecord where transcodingstate=2" > btafails.csv

and use these to filter the mediestream list 

csr@pc609:~/larm_doms_mapping$ diff --new-line-format="" --unchanged-line-format="" <(sort mediestream.csv ) <(sort btafails.csv ) > mediestream.filter.csv

ALTERNATIVE MEDIESTREAM

What if we want all transcoded radio files, whether or not they are in mediestream? This was made slightly complex by the existence of doms-link→-shard-file links, but is now much easier because during the migration to isilon we reversed the links. So

  1. Files 

    [develro@phoebe ~]$ find /radio-tv -type f -name '*.mp3' > radiofiles.txt
    csr@pc609:~/larm_doms_mapping$ cat radiofiles.txt |grep -o '[-0-9a-f]\+.mp3'|sed -s 's/.mp3//'|xargs -n 1 echo uuid:|sed -s 's/ //' > radiofiles_uuid.txt
  2. DB 

    scratch=# create table radiofiles (radiofile_guid varchar(100) primary key);
    copy radiofiles from '/home/csr/larm_doms_mapping/radiofiles_uuid.txt' CSV;
    

POST-PROCESSING

To process the data, read it into a scratch database: 

create table bta_transcodes (program_guid varchar(100) primary key, walltime timestamp); 
copy bta_transcodes from '/home/csr/larm_doms_mapping/bta_walltime.sorted.csv' CSV; 
create table all_programs (digitv_id varchar(100) primary key, ritzau_id varchar(100), starttime timestamp); 
copy all_programs from '/home/csr/larm_doms_mapping/all_programs.csv' CSV; 
create table all_transcodes_with_shards(program_guid varchar(100) primary key, digitv_id varchar(100), shard_gui varchar(100)); 
copy all_transcodes_with_shards from '/home/csr/larm_doms_mapping/all_transcodes_with_shards.csv' CSV; 
create table all_transcodes_without_shards(program_guid varchar(100) primary key, digitv_id varchar(100)); 
copy all_transcodes_without_shards from '/home/csr/larm_doms_mapping/all_transcodes_without_shards.csv' CSV; 
create table mediestream(program_guid varchar(100)); 
copy mediestream from '/home/csr/larm_doms_mapping/mediestream.filter.csv' CSV;

First we combine the two DOMS queries, to create a single view with nulls for those programs that don't have a shard-guid:

scratch=# create view all_transcodes as select * from mediestream natural left outer join (all_transcodes_without_shard natural left outer join all_transcodes_with_shard);

The left join with mediestream ensures that we get only those entries which are actually available in mediestream, and not all programs in DOMS.

Then export the results as follows

psql -d scratch -t -A -F"," -c "select program_guid,ritzau_id,shard_guid,walltime from all_transcodes NATURAL JOIN bta_transcodes NATURAL JOIN all_programs;" > bta_transcodings.csv 
psql -d scratch -t -A -F"," -c "select program_guid,ritzau_id,shard_guid,starttime-interval '20 seconds' from all_transcodes natural join all_programs where not exists (select null from bta_transcodes where bta_transcodes.program_guid=all_transcodes.program_guid);" > prebta_transcodings.csv

The first query is just a dump of bta_transcodes enriched with ritzau_id and shard_guid information.

The second query means "take all programs that are transcoded, but not with bta, and assume their timestamp is starttime - 20 seconds".

  • No labels