The Best Personalized Fundamental Stocks Scanner Setup (Part 1)
There are some good stock scanners, like Tradeview, that can filter stocks based on financial ratios, technical indicators, and fundamentals. What if you want a more tailor-made scanner? How about filtering all the stocks with a median 5-year Return on Equity (ROE) greater than 5%? It is not an easy task if you use online stock scanners. Neither does building your own (But at least possible).
In the United States, the U.S. SECURITIES AND EXCHANGE COMMISSION will upload financial statements data every quarter, thanks to the XBRL standard. You can find 10 years of data in this link. The only problem is it is quite difficult to understand the data structure and the meaning of those tags.
Let’s try to do an exercise to filter all stocks with a median 5-year return on equity (ROE) greater than 5%.
Step 1: Find the Tag for Net Income and Shareholders’ Equity
Instead of reading the messy taxonomy file, you can use Yeti Taxonomy Viewer to find the tag easily. Using the viewer, net income is NetIncomeLoss
and shareholders’ equity is StockholdersEquity
. Please also pay attention to the data type. Since we are interested in numeric data, the data type should be
. Also, for some items, the tag name may be different in different versions of the accounting standard. But from my experience, for some big items, they are consistent across different versions.
xbrli:monetaryItemType
Using the Yeti Taxonomy Viewer is quite intuitive. First, you need to register an account and login. Second, select the accounting standard. Third, just browse to the statement you normally would find the information and check the property:name
.
Step 2: Calculate the Median ROE Group by Stocks
I will now use SQL to query the data downloaded directly. As for how to do that, I will show you in Part 2. Basically, AWS S3, AWS Glue, and AWS Athena are used to make this possible.
Before showing you the SQL code, let me first show you an overview of the information contains in each table.
Table NameWhat is insideSUB (Submissions) [I name it as sub_tsv]Summary of submission. Basic information on the company.TAG (Tags) [I name it as tag_tsv]All standard taxonomy tags.NUM (Numbers) [I name it as num_tsv]All numeric data associated with submissions and tags.
For details on other tables, please refer to the documentation.
Getting 5-year NetIncomeLoss
Example Results:
adsh tag version ddate qtrs uom dimh iprx value footnote footlen dimn coreg durp datp dcml partition_0 0001640334-16-001065 NetIncomeLoss us-gaap/2015 20160131 4 USD 0x00000000 0 -2672412
0 0
0.01095891 0 0 2016q2 0001056285-19-000017 NetIncomeLoss us-gaap/2018 20190131 4 USD 0x00000000 0 3780000
0 0
0.021918058 -2 -3 2019q1 0001574135-19-000009 NetIncomeLoss us-gaap/2018 20190131 4 USD 0x00000000 0 -1.41903E+08
0 0
0.021918058 -1 -3 2019q1 0000751652-16-000410 NetIncomeLoss us-gaap/2015 20160630 4 USD 0x00000000 0 1.951E+08
0 0
0 0 -5 2016q3 0000109563-16-000283 NetIncomeLoss us-gaap/2015 20160630 4 USD 0x00000000 0 2.9577E+07
0 0
0 0 -3 2016q3
SQL Code
SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
AND tag = 'NetIncomeLoss'
AND dimh = '0x00000000'
AND qtrs = 4
AND iprx = 0
The
ddate
field contains the end date for the data value. I extract theYYYY
value from the date and make sure it is within the year 2015 to 2019.I extract the year value from
adsh
using Regex (The middle 2 digit number). Then make sure it is equal to the year value inddate
. The reason for doing so is that one report may contain more than 1 year’s data.I use
0x0000000
dimension data since otherdimh
values are used to link to other data points.I also make sure that
qtrs
, the count of the number of quarters represented by the data value is 4, which is yearly data.At last,
iprx
is set to0
because there may be duplicated value of the same tag in one report and0
represents the highest priority and precision.
Getting 5-year StockholdersEquity
The SQL code is similar to the above. Only qtrs
is set to 0
because StockholderEquity
is a point-in-time value. Please also noted that the values return contain both quarterly and yearly data. We will filter that out in the next step using sub_tsv
table.
Example Results
adsh tag version ddate qtrs uom dimh iprx value footnote footlen dimn coreg durp datp dcml partition_0 0001558891-17-000024 StockholdersEquity us-gaap/2016 20170331 0 USD 0x00000000 0 17403
0 0
0 0 0 2017q2 0000925645-17-000013 StockholdersEquity us-gaap/2015 20170331 0 USD 0x00000000 0 -1.16707E+08
0 0
0 0 -3 2017q2 0001564590-17-007150 StockholdersEquity us-gaap/2016 20170331 0 USD 0x00000000 0 6.93583E+08
0 0
0 0 -3 2017q2 0001562762-17-000048 StockholdersEquity us-gaap/2016 20170331 0 USD 0x00000000 0 2.78145E+08
0 0
0 0 -3 2017q2 0001171843-17-002300 StockholdersEquity us-gaap/2017 20170228 0 USD 0x00000000 0 2.719277E+09
0 0
0 3 -3 2017q2
SQL Code
SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
AND tag = 'StockholdersEquity'
AND dimh = '0x00000000'
AND qtrs = 0
AND iprx = 0
Step 3: Puting It Together
I used appl
data as a sample. There are 5-year StockholdersEquity
and NetIncomeLoss
.
adsh tag version ddate qtrs uom dimh iprx value footnote footlen dimn coreg durp datp dcml partition_0 adsh cik name fp 0000320193-19-000119 StockholdersEquity us-gaap/2019 20190930 0 USD 0x00000000 0 9.0488E+10
0 0
0 2 -6 2019q4 0000320193-19-000119 320193 APPLE INC FY 0001193125-15-356351 StockholdersEquity us-gaap/2015 20150930 0 USD 0x00000000 0 1.19355E+11
0 0
0 4 -6 2015q4 0001193125-15-356351 320193 APPLE INC FY 0000320193-19-000119 NetIncomeLoss us-gaap/2019 20190930 4 USD 0x00000000 0 5.5256E+10
0 0
0.021918058 2 -6 2019q4 0000320193-19-000119 320193 APPLE INC FY 0001193125-15-356351 NetIncomeLoss us-gaap/2015 20150930 4 USD 0x00000000 0 5.3394E+10
0 0
0.021918058 4 -6 2015q4 0001193125-15-356351 320193 APPLE INC FY 0000320193-17-000070 StockholdersEquity us-gaap/2017 20170930 0 USD 0x00000000 0 1.34047E+11
0 0
0 0 -6 2017q4 0000320193-17-000070 320193 APPLE INC FY 0000320193-17-000070 NetIncomeLoss us-gaap/2017 20170930 4 USD 0x00000000 0 4.8351E+10
0 0
-0.054793835 0 -6 2017q4 0000320193-17-000070 320193 APPLE INC FY 0000320193-18-000145 NetIncomeLoss us-gaap/2018 20180930 4 USD 0x00000000 0 5.9531E+10
0 0
0.021918058 1 -6 2018q4 0000320193-18-000145 320193 APPLE INC FY 0000320193-18-000145 StockholdersEquity us-gaap/2018 20180930 0 USD 0x00000000 0 1.07147E+11
0 0
0 1 -6 2018q4 0000320193-18-000145 320193 APPLE INC FY 0001628280-16-020309 StockholdersEquity us-gaap/2015 20160930 0 USD 0x00000000 0 1.28249E+11
0 0
0 6 -6 2016q4 0001628280-16-020309 320193 APPLE INC FY 0001628280-16-020309 NetIncomeLoss us-gaap/2015 20160930 4 USD 0x00000000 0 4.5687E+10
0 0
0.021918058 6 -6 2016q4 0001628280-16-020309 320193 APPLE INC FY
SQL Code
SELECT *
FROM
(SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
AND tag = 'NetIncomeLoss'
AND dimh = '0x00000000'
AND qtrs = 4
AND iprx = 0
UNION
SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
AND tag = 'StockholdersEquity'
AND dimh = '0x00000000'
AND qtrs = 0
AND iprx = 0 ) AS "data"
JOIN
(SELECT adsh,
cik,
name,
fp
FROM sub_tsv ) AS "basic_info"
ON "data"."adsh" = "basic_info"."adsh"
WHERE "basic_info"."fp" = 'FY'
I use
union
to combine two table fromNetIncomeLoss
andStockholdersEquity
together.I
join
thesub_tsv
table onadsh
code to obtain information like companycik
code andname
.I filter
fp
byFY
(financial year) so that all records are yearly. This solves the problem inStockholdersEquity
that contaminated by quarterly values.
Step 3: Filter Stocks with Median 5-year ROE greater than 5%
Finally, I come up with this insane SQL
. The majority of the codes are duplicated from the above steps. Don’t scare by it.
What is does are
Create a pivot table with rows as
cik
andddate
and columns asNetIncomeLoss
andStockholdersEquity
Then calculate the ROE for each year by dividing
NetIncomeLoss
withStockholdersEquity
.Then calculate the median ROE group by
cik
which is unique for each company
SELECT "Return"."cik",
approx_percentile("Return"."Return",
0.5) as "ROE"
FROM
(SELECT *,
"pivot"."NetIncomeLoss"/"pivot"."StockholdersEquity" AS "Return"
FROM
(SELECT "basic_info"."cik",
"data"."ddate",
max(case
WHEN "data"."tag" = 'NetIncomeLoss' THEN
"data"."value"
ELSE 0 end) AS "NetIncomeLoss", max(case
WHEN "data"."tag" = 'StockholdersEquity' THEN
"data"."value"
ELSE 0 end) AS "StockholdersEquity"
FROM
(SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
AND tag = 'NetIncomeLoss'
AND dimh = '0x00000000'
AND qtrs = 4
AND iprx = 0
UNION
SELECT *
FROM num_tsv
WHERE regexp_extract(cast(ddate AS varchar), '^([0-9]{4})', 1) IN ('2015','2016','2017','2018','2019')
AND regexp_extract(adsh, '^[0-9]{10}-([0-9]{2})-[0-9]{6}', 1) = regexp_extract(cast(ddate AS varchar), '^20([0-9]{2})', 1)
AND tag = 'StockholdersEquity'
AND dimh = '0x00000000'
AND qtrs = 0
AND iprx = 0 ) AS "data"
JOIN
(SELECT adsh,
cik,
name,
fp
FROM sub_tsv ) AS "basic_info"
ON "data"."adsh" = "basic_info"."adsh"
WHERE "basic_info"."fp" = 'FY'
GROUP BY "basic_info"."cik", "data"."ddate") AS "pivot" ) AS "Return"
GROUP BY "Return"."cik"
If you don’t follow the SQL code, it is perfectly normal since even myself will get loss. We can do this in Pandas
or create Views
as intermediate steps. That would be easier to understand.
Final results:
cikROE113505930.1527611761893378210020470.163996368003280523874841NaN468450.155445825307271865644720.065297320777548526357294NaN710104700.04060672229587468856135-Infinity915776030.0107457320.41289025728653544
Next, you can filter any stock with ROE
> 5%.
Next
In the following posts, I will show you how to scrap the fundamental data from SEC using Python!