In the previous video we explored loading data from 10x genomics matrix file.
In this video, we'll show you how to load the data from a spreadsheet file.
We'll demonstrate this with a recent paper on gene expressions
in the cells of a social amoeba called dictyostelium.
In the supplemental information of the paper
there is a link to an Excel file
with read counts and analysis.
The file has multiple worksheets,
with the first one containing the count data
and the second one containing normalized data.
We'll work with the normalized data.
To avoid confusion, let's first isolate it
by copying it into a new file with a single worksheet.
We'll save this new file to the desktop.
Now it's time to load the data into Orange.
We'll first need the Load Data widget from the Single Cell widget group.
Since the widget remembered the file list from our previous video,
We'll first remove this list
and then drag our data file from the desktop to the widget.
This data includes 81 cells
and over 11,000 genes.
We should make sure that Orange knows there is one header row
and that the first column contains labels.
Now we're ready to load the data.
Let's first check the data in the data table.
It's nothing special.
Lots of zeros like any single cell data set.
Next, we will filter out the genes that are expressed in only a few cells.
We'll use the Filter widget.
Filter on genes and detection count
and set the lower bound to 30 cells.
There are about 8,000 genes that are expressed in at least 30 cells.
Just for fun, let's estimate the distances between cells.
We'll use the Distances widget.
Set the distance metric to cosine,
because the number of genes or features is large,
and construct a hierarchical clustering.
Here it is.
The clustering looks very similar to the dendrogram the authors reported in the paper.
In fact, if we select the lower branch of the clustering tree
and check the IDs of the cells,
they indeed appear in the same cluster,
which is marked as Cluster A in the paper.
Our cluster contains cells D5,
D12, D18, all the way through D90.
The only difference is cell D3,
which is likely present
due to any differences in gene filtering and clustering parameters.
Orange can also load other types of tab-delimited data.
We'll demonstrate this with Broad's Single Cell Portal.
Let's find the data on dividing neural cells
and download the expression data on neurons from the spinal cord.
This time the data comes in a zipped, tab-delimited format.
We'll use a new instance of the Load Data widget.
Clear any of the past used files
and drag the zipped file to the widget.
We have 185 cells
and over 25,000 genes.
Just to check - here is a data table
and a t-SNE visualization.
Orange can also load other data formats, including loom,
but that's enough for now.
Let's explore what to do with all these data.
Check out our next videos in the series
for single cell data processing,
marker genes clustering and more.
